Affinity Prediction

CDR L3 and H3 Loop Length Diversity in Approved Therapeutic Antibodies

Anya Petrova · · 9 min read
CDR L3 and H3 loop length distribution histograms from Thera-SAbDab analysis

CDR loop length is one of the most consequential but least-discussed aspects of antibody library design. When building a naive or semi-synthetic antibody library, decisions about CDR-H3 length distribution determine which regions of antigen space you can access — and which you can't. Yet library design choices are often made based on convenience (what oligonucleotide synthesis supports) or convention (copying parameters from published libraries) rather than from a principled view of what the approved therapeutic antibody space actually looks like.

We've been building our loop sampling priors from the Thera-SAbDab dataset, which catalogs FDA-approved and late-stage antibody therapeutics with paired VH/VL sequence information. Looking at this dataset from a loop length perspective turns out to be informative in ways that aren't obvious from general antibody repertoire statistics — the therapeutic-approved set shows selection pressures that general repertoire data doesn't capture.

CDR-H3: Wider Range Than Libraries Typically Capture

CDR-H3 length, defined by IMGT (positions 105–117), varies from approximately 3 to 28 residues in the human naive repertoire. Approved therapeutics, however, are not uniformly distributed across that range. In our analysis of 150 approved mAbs from Thera-SAbDab with confirmed sequences (excluding bispecifics with non-standard formats), CDR-H3 length follows a distribution that peaks between 12 and 14 residues (by IMGT count, which excludes the conserved Cys and Trp anchors), with a median of 13 and a mean of 12.8.

What's notable is the right tail: 23% of approved therapeutics have CDR-H3 lengths of 17 or longer. This is above what most phage display libraries are designed to sample efficiently. Standard NNK-based randomization of CDR-H3 at fixed length works well for 10–14 residue loops, but longer loops require specific sampling strategies — particularly the kink-extended and bulged kink conformations that dominate CDR-H3 structures with >14 residues. Libraries that truncate CDR-H3 at length 14 by oligonucleotide design are systematically excluding a substantial fraction of the approved therapeutic sequence space.

The short-loop end is also underrepresented in many libraries. CDR-H3 lengths of 8–9 residues (IMGT) account for about 11% of approved therapeutics, and these short loops often engage epitopes that longer loops physically cannot access — shallow, flat surfaces where a shorter, more constrained loop provides better complementarity than a longer, more flexible one. A library balanced around lengths 11–14 misses both tails.

CDR-L3: Much Tighter Distribution

CDR-L3 behaves very differently from CDR-H3 in the approved therapeutic set. Under IMGT numbering, CDR-L3 in kappa chains — which account for 75% of approved mAb light chains — is almost entirely 9 residues long. In our dataset, 112 of 150 antibodies (74.7%) have kappa CDR-L3 of exactly 9 residues. Lengths of 8 and 10 account for most of the remainder, with essentially nothing at length 7 or below, and very few at 11 or longer.

Lambda light chain CDR-L3 shows slightly more length variation (6–12 residues), but lambda chains are less common in approved therapeutics and have their own structural constraints.

This extreme CDR-L3 length conservation is not accidental — it reflects structural constraints on the canonical loop conformations available to light chain CDR-L3. The kappa CDR-L3 at 9 residues adopts what is structurally classified as the L3/L5 conformation family (following the AbM canonical structure nomenclature), which is by far the most common and most productive for binding. Deviating from length 9 requires specific structural support from framework positions that is not universally available.

The practical implication: for CDR-L3 in kappa-based libraries, fixing length at 9 and diversifying sequence is almost always the right approach. Introducing length variation in CDR-L3 at library design stage adds complexity without meaningfully expanding the productive sequence space, because non-length-9 loops fold poorly without specific framework accommodations that single-domain library scaffolds don't provide.

Therapeutic Selection Bias vs. Biological Repertoire

It's important to interpret Thera-SAbDab length distributions with awareness of what they represent: the output of a very specific selection process. The approved therapeutic antibody set has been through ADME screening, developability assessment, immunogenicity evaluation, and clinical safety filtering in addition to efficacy selection. These filters are not neutral with respect to loop length.

Long CDR-H3 loops (>18 residues) are underrepresented in approved therapeutics relative to their prevalence in the naive repertoire. This isn't because they can't bind antigens — they can, and some bind exceptionally well. It's that very long CDR-H3 loops are associated with higher aggregation propensity, increased proteolytic susceptibility, and more difficult manufacturability. The therapeutic set shows survivorship bias: long loops that made it through development tended to be those that happened to have buried hydrophobic character or intramolecular contacts that confer stability.

For computational library design, this creates a tension: designing against the approved therapeutic distribution means incorporating the survivorship-biased length priors, which may exclude loops that could bind your target of interest but have developability flags. The pragmatic resolution is to use the therapeutic distribution as your central design space and treat outlier lengths as a secondary exploration space with explicit developability risk acknowledgment.

Implications for Loop Sampling in Structure Prediction

Loop length distribution priors also matter for computational structure prediction. ABodyBuilder2 and similar antibody-specific structure prediction tools have been trained predominantly on the available crystallographic data, which is biased toward the same 10–14 residue CDR-H3 sweet spot that dominates the training set. For CDR-H3 loops at length 17–20 residues, these models are working at the edge of their training distribution, and accuracy suffers accordingly.

In our internal benchmarking on the SAbDab holdout set, CDR-H3 length is the single most predictive variable for structure prediction RMSD — more so than the specific method used. Loops of 8–12 residues show median RMSD of 1.2 Å across the tested methods; loops of 16–18 residues show median RMSD of 3.1 Å; loops of 19+ residues deteriorate further to 5.4 Å median RMSD (though sample sizes get small at these lengths).

This length-accuracy relationship directly affects how we use structure predictions downstream. For an antibody with CDR-H3 of 14 residues, a structure prediction confidence score of 0.85 is fairly trustworthy. For a CDR-H3 of 18 residues, the same confidence score is less reliable — we apply additional sampling (extended loop modeling with RosettaLoopModel) and filter more conservatively before using the structure in docking or ΔΔG calculations.

Loop Length and Accessible Epitope Shape Space

There is a well-characterized correlation between CDR-H3 length and the classes of epitopes that an antibody can bind. Shorter CDR-H3 loops (8–11 residues, IMGT) are better at engaging concave or groove-type epitopes — enzyme active sites, receptor binding clefts, cleft antigens — where the shorter loop can insert into the pocket. Longer loops (14–18 residues) tend to dominate on convex epitopes and flat protein surfaces, where the extended loop can make more contacts across the antigen surface.

This isn't a strict rule — the antigen conformation at the binding site, the contributions from other CDR loops (particularly CDR-H2 and CDR-L3), and specific sequence-structural factors all modulate which epitopes a given loop length can access. But at the library design stage, it's a useful prior: if you're building a campaign targeting a known buried active site (e.g., kinase ATP binding site, protease active site), enriching your library at shorter CDR-H3 lengths is likely more productive than a uniform distribution.

Conversely, if you're targeting a poorly characterized antigen where the epitope is unknown, a length distribution that mirrors the approved therapeutic distribution — centered at 12–14 residues with meaningful representation at 9–11 and 15–18 — gives you the broadest coverage of the shapes that have actually worked for drug development.

What This Means for CDR Design Tools

For computational CDR design — as opposed to library construction — loop length selection happens before you start designing sequence, and it should be deliberate. When we run a CDR-H3 optimization campaign at Genolux, we first query the structural database for antibodies with CDR-H3 at the target length, extract the geometric constraints on those loops (kink angle, backbone torsion distributions), and use those as priors for our sampling strategy. A CDR-H3 at length 14 has a well-defined conformational distribution; designing into that distribution is more efficient than treating the loop as unconstrained.

The Thera-SAbDab length analysis is one input into this — knowing that the approved therapeutic set heavily populates CDR-H3 at 12–14 residues means that working in that length range gives you the highest-quality structural training data and the lowest structural prediction uncertainty. When a program requires a longer loop, we go in knowing the uncertainty is higher and budget accordingly for additional experimental structural characterization.