AlphaFold2 for Antibody Design: Real Limitations You Need to Know

AlphaFold2 is a genuine advance. For the global protein structure prediction problem — folding a protein with unknown structure from its sequence — it represents a decade's progress compressed into a single model release. We use it extensively at Genolux: every antibody program that comes to us without an experimental Fv structure gets an AlphaFold2 model as the first step of the pipeline. That's not in question.

What is in question is the precision with which AlphaFold2 predicts CDR loops, particularly CDR H3, and how that precision gap affects downstream ΔΔG calculations and docking. The field has moved fast enough that there's a risk of assuming AlphaFold2 has solved the antibody structure problem in the same way it solved the framework fold problem — and it hasn't. The failure modes are specific, predictable, and worth understanding before you commit synthesis resources to a list derived from AlphaFold2-based calculations.

This post documents what we've observed working through the SAbDab holdout set — antibody crystal structures deposited after AlphaFold2's training cutoff, which we use as an uncontaminated test set for evaluating prediction accuracy.

The CDR Loop Prediction Problem: Structural Biology Background

CDR loops — particularly CDR H3 — are structurally distinct from the rest of the antibody Fv domain. The framework regions (VH and VL beta-barrel scaffolds) are highly conserved across antibody sequences and are well-represented in AlphaFold2's training data through thousands of PDB-deposited crystal structures. Framework prediction by AlphaFold2 is excellent: backbone RMSDs of 0.5–0.8 Å relative to crystal structures are typical, consistent with general-domain performance on the CASP14 benchmark.

CDR loops are different for three reasons. First, they are intrinsically disordered in solution and adopt their bound conformation only on contact with antigen; the unbound conformation that gets crystallized and deposited in PDB may not reflect the binding-competent state. Second, CDR H3 length and sequence diversity are enormous — unlike the other five CDRs, H3 shows no strong canonical structure preference beyond a few well-characterized "torso" motifs at its base, and loop lengths from 3 to 28+ residues create a space that's thinly sampled even in a large training corpus. Third, inter-CDR contacts (H3 packing against H1 or L3 in some antibody structures) create dependencies that single-sequence folding models handle with varying accuracy.

Measured Performance on the SAbDab Holdout

We evaluated AlphaFold2 (monomer, single-sequence mode, no MSA to reflect typical usage for novel antibody sequences) against 200 crystal structures from SAbDab deposited between January 2023 and June 2024 — after AlphaFold2's training data cutoff. We computed backbone RMSD for each CDR separately, grouped by loop length, and also reported the fraction of predictions with RMSD < 1.5 Å ("near-native") and RMSD > 3.0 Å ("large deviation") by CDR and length bin.

Summary of findings:

CDR	Median backbone RMSD (all lengths)	Fraction near-native (< 1.5 Å)	Fraction large deviation (> 3.0 Å)
CDR H1	0.9 Å	71%	8%
CDR H2	0.8 Å	74%	6%
CDR H3 (≤ 10 res.)	1.4 Å	54%	16%
CDR H3 (11–14 res.)	2.3 Å	31%	34%
CDR H3 (≥ 15 res.)	3.7 Å	11%	57%
CDR L1	0.9 Å	70%	7%
CDR L2	0.6 Å	83%	3%
CDR L3	1.1 Å	62%	12%

The degradation with H3 loop length is steep and non-linear. CDR H3 loops of 11–14 residues show large-deviation predictions at 34%, which is high enough that you can't treat the AlphaFold2 model as reliable for this CDR without additional validation. For loops ≥ 15 residues, the majority of predictions have backbone RMSD > 3.0 Å — meaning the loop position in the model is substantially wrong relative to the crystal structure.

This pattern is not unique to AlphaFold2. Purpose-built antibody structure prediction tools like ABodyBuilder2 and IgFold perform better on CDR H3, particularly for loops 11–14 residues, at roughly 22–28% large-deviation rates in that length bin. The advantage of specialized tools over AlphaFold2 for antibody CDR loop prediction is real and measurable, which is why we run both and use the ensemble when possible.

What pLDDT Scores Tell You (and Don't Tell You)

AlphaFold2 reports per-residue predicted local distance difference test (pLDDT) scores, which are widely used as a confidence metric. The connection between pLDDT and actual structural accuracy for CDR loops is real but imprecise. High pLDDT (≥ 80) is a necessary but not sufficient condition for accurate loop prediction — we've seen CDR H3 loops with per-residue pLDDT of 75–80 and RMSD > 3.0 Å from crystal structure. The pLDDT score is more reliable as an indicator of structural disorder than as an indicator of incorrect folding: very low pLDDT (≤ 50) reliably indicates a region that's disordered or unresolved, but moderate pLDDT doesn't guarantee accuracy.

Our internal guideline: for CDR H3 loops with all-residue pLDDT ≥ 80 and length ≤ 10, treat the AlphaFold2 model as a reasonable basis for ΔΔG calculations. For H3 loops with any residue pLDDT < 70, or length 11+, run Rosetta loop modeling (500 decoys, select centroid score minimum) before ΔΔG calculations. For H3 length ≥ 15, flag for review regardless of pLDDT.

Failure Mode: The Loop Stuck in Extended Conformation

The most consistent failure mode we observe in AlphaFold2 CDR H3 predictions for long loops (≥ 15 residues) is what we've been calling the "extended conformation trap." The model predicts the loop in a partially extended, relatively straight conformation that sits away from the antibody body, rather than the compacted, curled or hairpin-like conformations common in crystallized antibodies. In crystal structures, long H3 loops are frequently stabilized by inter-residue hydrogen bonds and van der Waals packing with adjacent CDRs; without antigen context, AlphaFold2 appears to default to a low-energy extended state that lacks these stabilizing interactions.

This matters for ΔΔG calculations because the position of the loop tip relative to the antigen surface determines which residues are predicted to contact the epitope. A 3–4 Å shift in the loop tip position can change the identity of predicted contact residues entirely, leading to ΔΔG calculations that are scanning mutations at the wrong positions.

We ran a specific validation case in late 2024: an antibody with CDR H3 = 17 residues, experimentally confirmed by SPR at K_D = 12 nM, against a growth factor receptor domain (antigen PDB structure available). AlphaFold2 predicted H3 in extended conformation, RMSD 4.2 Å from a subsequently obtained crystal structure. The ΔΔG scan from the AlphaFold2 model suggested positions 100b–100d as the most mutable (low cost to substitute). The crystal structure, once available, showed positions 99–100a as the primary epitope contacts. We'd have designed a synthesis list emphasizing the wrong positions from the AlphaFold2-only calculation.

We're not saying this makes AlphaFold2 unusable for long-H3 programs. We're saying it makes the loop-refinement step non-optional, and the ΔΔG results from unrefined AlphaFold2 models for long-H3 loops should be treated as hypothesis generation, not synthesis-ready rankings.

Practical Routing Based on H3 Length

Based on our holdout benchmarking, we've settled on a three-tier routing for structural input to ΔΔG calculations:

Tier 1 — AlphaFold2 sufficient (H3 ≤ 10 residues, pLDDT ≥ 80): Proceed directly from AlphaFold2 Fv model to Rosetta ΔΔG scanning. Report confidence as "medium-high" for H3 mutations, "high" for H1, H2, L1–L3.

Tier 2 — Loop refinement required (H3 11–14 residues, or any H3 with pLDDT 60–80): Run ABodyBuilder2 alongside AlphaFold2, generate 500-decoy loop ensemble using Rosetta, select 5 lowest-energy conformers, report ensemble-median ΔΔG. Confidence "medium" for H3 mutations.

Tier 3 — Experimental anchor recommended (H3 ≥ 15 residues, or pLDDT < 60 in H3): Computational ΔΔG calculations are qualitative guidance only. Explicitly recommend SPR cross-validation for a 15–20 variant pilot set before committing to a larger synthesis list. If crystallography or cryo-EM data is available or feasible, prioritize it as the structural input.

This routing doesn't eliminate AlphaFold2 from long-H3 programs — it calibrates how much experimental validation is needed before acting on computational predictions. For programs already at a stage where structural data is justified by the investment, the computational predictions provide a way to prioritize what to measure first.

Comparison with Antibody-Specific Prediction Tools

ABodyBuilder2 and IgFold both outperform AlphaFold2 on CDR H3 for loops ≥ 11 residues in our holdout benchmarking. The performance difference is most pronounced at 13–16 residues: ABodyBuilder2 achieves approximately 45% near-native (< 1.5 Å) rate in this length bin on our test set, compared to 31% for AlphaFold2. IgFold is similar to ABodyBuilder2, with some variation depending on sequence similarity to training data.

For framework regions and CDRs H1, H2, L1–L3, the performance difference between AlphaFold2 and antibody-specific tools is small — AlphaFold2 is competitive. The advantage of antibody-specific tools is concentrated in exactly the region that matters most for affinity calculations: the CDR H3 loop and its immediate surroundings.

The practical recommendation is not to pick one tool and use it exclusively, but to run multiple predictions and use disagreement between models as an uncertainty signal. When AlphaFold2 and ABodyBuilder2 predict similar H3 conformations (RMSD between predictions < 1.0 Å), that concordance suggests the prediction is more reliable. When they disagree substantially (RMSD > 2.0 Å), that's a flag that the conformational space is genuinely uncertain and the ΔΔG results from either model should be held loosely until experimental anchor data is available.