Benchmarking Antibody Structure Prediction Models on SAbDab: 2025 Update

Antibody structure prediction has moved fast in the past three years. When we started building Genolux's computational platform, the choices were limited and the performance gaps between methods were large. That's no longer the case — several methods now achieve sub-angstrom accuracy on structured framework regions, and the competitive differentiation has shifted almost entirely to CDR loop performance, particularly CDR-H3.

We decided to do a systematic, reproducible comparison on a held-out SAbDab set rather than rely on each method's published self-benchmarks. Self-benchmarks are optimistic: methods report results on data partitions that may not be genuinely held-out, use favorable resolution cutoffs, or cherry-pick metrics that reflect well on their approach. We wanted to know how these methods actually perform for our use case — which is providing structures for downstream docking and ΔΔG calculation, not just answering "is the fold right?"

The code and dataset we used for this benchmark are available on our GitHub (genolux-research/sabdab-benchmark-2025), including the structure filtering scripts, RMSD calculation routines, and the full results table. We wanted this to be reproducible so that teams can extend it with new methods or updated versions of existing ones.

Dataset and Filtering

We built a holdout set from SAbDab structures deposited between January 2023 and September 2024, after the training cutoff dates we could establish for the major methods. Inclusion criteria: resolution ≤ 2.2 Å; paired Fv (VH + VL) present; no missing residues in CDR-H3 or CDR-L3; no crystal contacts within 5 Å of any CDR loop. Structures were grouped by sequence identity clustering: any two structures with >95% VH sequence identity were considered redundant, and one was retained at random. Final holdout: 200 structures.

This clustering step matters more than it might seem. SAbDab contains redundancy at the sequence level — multiple crystal forms of the same antibody, close homologs from related programs. If you don't cluster before evaluation, your RMSD statistics will be influenced by how many near-identical structures are in your set, and methods with more recent training data will benefit disproportionately from memorizing high-identity sequences.

For each of the 200 structures, we predicted Fv structure using each of the five methods tested: ABodyBuilder2 (v2.3), IgFold (v1.0.0), ESMFold (v1), AlphaFold2-Multimer (v2.3, using the paired VH/VL sequence), and our internal model (GnlxFv-1.2). All methods received only the VH and VL sequences as input — no antigen sequence, no template structure, no crystal contacts. Methods that offer confidence scores (pLDDT or equivalent) provided those alongside the coordinates.

Overall Framework and CDR Accuracy

Framework accuracy — meaning the backbone RMSD on framework residues (FR1–FR4 for both chains, excluding CDR loops by IMGT definition) — is excellent for all methods. Median framework RMSD across the 200-structure set: ABodyBuilder2 0.48 Å, IgFold 0.51 Å, ESMFold 0.54 Å, AlphaFold2-Multimer 0.44 Å, GnlxFv-1.2 0.47 Å. At these accuracy levels, framework RMSD is not a meaningful discriminator — all five methods get the scaffold right.

CDR-H3 is where the methods diverge. Here are median CDR-H3 RMSDs (backbone atoms, all lengths pooled): ABodyBuilder2 2.1 Å, IgFold 2.6 Å, ESMFold 3.4 Å, AlphaFold2-Multimer 2.4 Å, GnlxFv-1.2 1.8 Å. There's a real spread here, and it matters for applications like ΔΔG calculation where CDR-H3 position errors propagate into interface score errors.

For CDR-L3 and the other CDR loops (H1, H2, L1, L2), differences between methods are smaller — most fall in the 0.6–1.1 Å median range. CDR-L3 accuracy (median 0.7–0.9 Å across methods) is notably better than CDR-H3, consistent with the structural rigidity of CDR-L3 at length 9 that we discussed in our loop length diversity post.

CDR-H3 Length Stratification: Where Models Break Down

Pooled CDR-H3 statistics obscure the most important pattern. When we stratify by CDR-H3 length, the behavior is markedly different across the length range, and all methods degrade at longer loops — but not at the same rate or threshold.

For CDR-H3 lengths 8–12 (IMGT): all five methods perform well, with median RMSD < 1.5 Å. This is the sweet spot of the training data, and all models have learned good representations of these loops.

For CDR-H3 lengths 13–15: ABodyBuilder2 and GnlxFv maintain median RMSD below 2.0 Å. IgFold and AlphaFold2-Multimer start diverging, reaching 2.3–2.8 Å median. ESMFold drops most sharply here, reaching 3.5 Å median at length 15.

For CDR-H3 lengths 16–18: median RMSDs for all methods are in the 3.0–5.0 Å range. At this point, the top-1 structure from any method should be considered a hypothesis rather than a prediction. Our internal recommendation for loops in this range is to generate an ensemble of 10–20 structures using different methods and cluster them — structural consensus across methods is more informative than any single top-1 prediction.

Lengths 19+: only 11 structures in our holdout fall in this category, so statistics are noisy. All methods show median RMSD > 4.5 Å, and confidence scores from pLDDT-based methods do not reliably distinguish better from worse predictions in this range — the confidence score is not calibrated for this regime.

Confidence Score Calibration

A benchmark that reports only RMSD is incomplete for practical use — what matters for pipeline decisions is not just accuracy but calibration: does a high confidence score actually predict lower RMSD? This is where methods differ substantially in ways that don't show up in the average RMSD tables.

ABodyBuilder2's confidence score shows the best calibration in our analysis: within CDR-H3 length bin 13–15, structures in the top confidence quartile have median RMSD 1.6 Å vs. 3.2 Å for bottom quartile. The confidence signal is informative. AlphaFold2-Multimer's pLDDT scores are well-calibrated for framework regions but poorly calibrated for CDR-H3 — high pLDDT CDR-H3 predictions are not systematically more accurate in our dataset.

ESMFold confidence is essentially uncalibrated for CDR loops in our test set — we see no significant correlation between reported confidence and CDR-H3 RMSD. This is consistent with ESMFold being designed primarily for protein fold prediction rather than loop accuracy specifically.

IgFold's confidence (reported as a per-residue pLDDT-equivalent) shows moderate calibration at length 10–14 and breaks down at longer loops, similar to AlphaFold2-Multimer.

GnlxFv-1.2's confidence scores are calibrated by design against this holdout distribution (we trained and validated specifically against SAbDab), which is part of why its CDR-H3 confidence-RMSD correlation (Spearman ρ = −0.62) is the strongest in the comparison. We disclose this explicitly: our internal model was optimized against a partition of SAbDab, and it has home-field advantage on this benchmark.

What This Means for Downstream Use

Structure predictions feed downstream calculations. For ΔΔG estimation of CDR mutations (the core of our affinity maturation pipeline), CDR-H3 RMSD translates approximately linearly to ΔΔG error in our calibration experiments: each 1 Å of CDR-H3 RMSD error contributes roughly 0.4–0.6 REU of ΔΔG noise. For CDR-H3 at length 10–14, where prediction errors are 1.0–2.0 Å, this is acceptable — the signal is above the noise. For CDR-H3 at length 16+, prediction error dominates and ΔΔG calculations on single structures are unreliable.

For docking applications, the VH-VL orientation (measured by Fv packing angle) is often more critical than CDR loop absolute accuracy. Here AlphaFold2-Multimer performs best — median VH-VL RMSD 0.52 Å vs. 0.65 Å for ABodyBuilder2 — likely because the multimer model captures VH-VL co-evolutionary signals that antibody-specific models handle through a VH-VL packing module. When docking quality matters most, we use AlphaFold2-Multimer for the initial structure and then do local CDR loop refinement with ABodyBuilder2 loop predictions grafted onto the AlphaFold2 VH-VL framework.

We're not claiming this hybrid approach is optimal for all use cases. It's what our internal experiments indicated for our specific pipeline, and the benchmark data support the two-method rationale. For labs that can only run one method, ABodyBuilder2 is our recommendation based on the overall CDR-H3 accuracy and confidence calibration performance. For anyone wanting to stress-test their own structures before a high-stakes application, run multiple methods and compare — disagreement between methods is the most reliable signal that a prediction is uncertain.

Plans for Ongoing Updates

Structure prediction methods are being updated regularly. The benchmark we've published is a snapshot — SAbDab structures deposited through September 2024, methods at specific version numbers. We plan to update the holdout set annually as new high-resolution structures become available and as methods release new versions.

We'll also extend the benchmark to cover antibody-antigen complex structure prediction (AlphaFold2-Multimer and similar multimer methods) as the holdout set grows. This is a harder problem, and the current benchmark is Fv-only by design — adding antigen geometry introduces additional failure modes that deserve their own analysis rather than being folded into a Fv accuracy table.

The evaluation code handles new method outputs straightforwardly if they follow standard PDB format with IMGT-numbered residues. If you want to add a method or submit a correction to the dataset, contributions to the repository are open.