Affinity maturation by experimental directed evolution is the default path for most antibody programs. Run a few rounds of error-prone PCR or site-saturation mutagenesis, pan against your target, sequence the survivors. The logic is sound and the data is real. But the throughput ceiling is real too: you can screen perhaps 107–109 variants across a full phage display campaign, and most of those variants are noise relative to the narrow set of CDR positions that actually contact the epitope.
We've been working through this problem computationally at Genolux since we started, and the core insight isn't complicated: Rosetta ΔΔG scanning works well when the starting structure is good, the CDR you're mutating is H1, H2, L1, or L2 (the shorter, more constrained loops), and the question is single-point substitution rather than loop insertions. It works less well for long CDR H3 loops, multi-position combinatorial mutations, and cases where the antibody is operating near the edge of its solubility range. Being specific about those limits matters more than overselling the wins.
The Energy Landscape of CDR H3
CDR H3 is the primary driver of paratope-epitope specificity in most antibody programs. It's also the loop that gives computational methods the most trouble. H3 loops range from 3 residues to over 28 (Kabat numbering), with therapeutic antibodies clustering between 10 and 16 residues. Beyond about 12 residues, the conformational space available to that loop during docking is enormous, and the Rosetta energy function has limited ability to distinguish near-native from decoy states without an experimental starting structure to anchor the sampling.
What this means practically: if your discovery campaign produced a hit with an H3 length of 8–10 residues and you have a decent homology model (sequence identity to template ≥ 70%), ΔΔG scanning of the full H3 is tractable and usually informative. If H3 is 16+ residues and your starting model is built from a distant template, the error bars on individual position ΔΔG values are ±1.0–1.5 REU, which can wash out real signal. In that regime, we shift focus to H1, H2, and the light chain CDRs where the conformational space is tighter and the energy function behaves more predictably.
When Rosetta ΔΔG Scanning Outperforms Experimental Evolution
The comparison with directed evolution isn't about accuracy in isolation — it's about the return on information per experiment. Experimental affinity maturation produces real binding data, but the readout is typically a rank-ordering of full-clone sequences after panning, not a per-residue ΔΔG map. You learn which variants survived selection, not why.
Computational scanning gives you the opposite: per-residue energetics without any panning. A full single-point scan of CDR H3 (12-residue loop × 20 amino acids = 240 mutations) runs in under two hours on a modern workstation using Rosetta's ddg_monomer protocol applied to a fixed backbone. That scan tells you which positions are hypervariable (low cost to mutate) and which are structurally constrained (high ΔΔG for most substitutions). That residue-level annotation is something phage display doesn't produce — and it meaningfully improves library design for the experimental round that follows.
A concrete scenario: a small biotech in the Baltimore area came to us in mid-2024 with a 12-residue CDR H3 antibody against a cytokine target, KD around 8 nM from SPR, wanting to push below 1 nM. We ran ΔΔG scanning on all six CDRs from a homology model (92% identity to PDB template), identified three H3 positions and one H2 position where Trp, Tyr, or Phe substitutions were energetically favorable, and generated a focused 48-variant synthesis list. Experimental confirmation of synthesized variants showed 6 of the 48 hit ≤ 1.5 nM, with the best at 0.4 nM. A full site-saturation mutagenesis library covering just CDR H3 at 12 positions would have required synthesizing 240 variants to get equivalent coverage of that residue set — at higher cost and 3–4 additional weeks of experimental time.
We're not saying computational scanning replaces experimental affinity maturation. The 6 confirmed improvers out of 48 meant 42 misses — a hit rate around 12%, which is actually useful but not magical. What changed was the synthesis list went from 240 random candidates to 48 energetically prioritized ones.
Where the Protocol Breaks Down
Multi-position combinatorial mutations are the first failure mode. The Rosetta ΔΔG protocol assumes a fixed backbone and scores each substitution independently. When you have two or three simultaneous mutations, the coupling between side chains matters — a Trp at position 100 might be stabilizing only in the context of a Tyr at position 97. The pairwise interaction terms are captured in the REF2015 energy function, but in practice, iterative single-mutation scanning followed by combinatorial synthesis of the top candidates works better than trying to score all pairwise combinations computationally. The computational cost of exhaustive pairwise scanning is also prohibitive beyond about 3–4 positions.
The second failure mode is developability trade-offs. A mutation that increases binding affinity may simultaneously introduce a solubility liability, increase aggregation propensity as measured by the spatial aggregation propensity (SAP) score, or create a post-translational modification motif. We've seen cases where a ΔΔG-favorable Trp substitution in CDR H3 drove the aggregation score above the threshold we use internally (SAP ≥ 0.5 per residue in the paratope region). Affinity optimization and developability optimization are two coupled problems, and running them independently produces suboptimal candidates. That's the integration problem we built Genolux's pipeline around: computing both ΔΔG and developability metrics from the same structural model, then ranking by a composite score.
Loop Length and the Mutation Feasibility Map
One useful heuristic for deciding where to focus CDR scanning: treat CDR loop length as a rough proxy for computational tractability. Loops up to 8 residues (most CDR L1, L2, H1, H2; short H3) have well-characterized backbone geometries and respond predictably to Rosetta energy minimization. Loops 9–13 residues require careful backbone sampling but remain tractable with Rosetta's loop_modeling followed by ΔΔG scanning. Loops 14+ residues require ensemble sampling — modeling 5–10 low-energy conformers and averaging ΔΔG values across them — which adds computational overhead and increases variance. At 18+ residues, we treat computational ΔΔG as hypothesis generation only, not as a substitute for experimental screening.
This isn't a limitation unique to Rosetta — it applies to essentially all physics-based energy functions. The conformational entropy of a 20-residue loop is large enough that sampling errors in the starting model dominate the ΔΔG signal.
Integration with the Experimental Campaign
The most effective workflow we've found is a hybrid campaign structure: one computational scan round early, followed by a focused experimental round, followed by a second computational pass on confirmed binders. The first scan generates the synthesis list. The experimental round provides ground-truth data on which computational predictions were accurate. The second pass uses the confirmed binders as anchors for a tighter scan around neighboring positions.
This loop — predict, synthesize, confirm, re-predict with better anchors — is substantially more efficient than either pure experimental screening or pure computational scoring. The experimental data corrects the systematic biases of the energy function (which tends to favor aromatic substitutions), and the computational pass prevents you from doing exhaustive experimental mutagenesis around positions that are genuinely constrained.
The stopping criterion matters too. Affinity maturation campaigns often continue past the point of diminishing returns because there's always one more variant to synthesize. Our internal guideline: when three consecutive scan-and-confirm cycles fail to identify a variant with ΔΔG < −0.5 REU in the full-relaxed model, the residual gain is likely to be within experimental error of the measurement. At that point, the engineering problem shifts from affinity to developability — whether the lead candidate can survive the formulation and manufacturing process.
The question of when to stop mutating is ultimately a portfolio question. If you have a 2 nM lead that passes developability filters, the question isn't whether you can get to 0.5 nM — you probably can — but whether the additional synthesis cycles and development risk of a new variant are justified given the clinical target product profile. That decision is yours, not ours. What we can tell you is whether the energy landscape suggests further gains are plausible, and which positions are the most likely levers.