Developability Scoring: Which Physicochemical Properties Actually Predict Failure

The developability space for therapeutic antibodies is cluttered with proposed metrics. Published studies have proposed viscosity indices, isoelectric points, net charge at physiological pH, colloidal stability scores, spatial aggregation propensity (SAP), charge symmetry parameters, hydrophobic patch areas, predicted melting temperature, solubility scores, polyreactivity flags, and at least two dozen more. If you tried to hard-filter every antibody candidate against all proposed thresholds, you'd eliminate a third of approved therapeutics as developability failures.

The proliferation of metrics reflects a real problem: CMC (chemistry, manufacturing, and controls) failures are expensive — a candidate that aggregates in formulation, precipitates at therapeutic concentration, or absorbs to device surfaces can fail late in development after years of investment. The instinct to flag more properties earlier is sound. The problem is that many proposed metrics are correlated with each other, derived from small training sets, or validated only on narrow chemical classes of antibodies. The result is a field where every team has a slightly different developability panel, and there's limited consensus on which metrics carry the most independent predictive weight.

We spent a significant portion of 2024 working through this, applying our scoring pipeline to a curated set of 400 therapeutic antibodies from the public Thera-SAbDab dataset for which CMC-related outcome annotations were available — primarily aggregation tendency, viscosity at ≥ 100 mg/mL, and expression yield. The objective was to ask: which sequence-computable metrics, applied before synthesis, actually distinguish the candidates that failed CMC from those that didn't?

The Metric Reduction Problem

We started with 28 metrics that can be computed from sequence alone or from a fast homology model. The first step was correlation clustering: many pairs of metrics are ≥ 0.75 correlated across the dataset (for example, net charge at pH 7.4 and isoelectric point, or SAP score and hydrophobic patch area). Highly correlated metrics contribute no independent information; using both inflates the apparent weight of that underlying feature and can create instability in composite scoring.

After removing correlated pairs (keeping the member of each pair with higher univariate predictive power) and metrics that failed to reach statistical significance on the 400-antibody set (p > 0.05 in logistic regression against the CMC outcome label), we were left with seven metrics that each contributed independent predictive signal:

CDR SAP score (spatial aggregation propensity, paratope region only) — strongest individual predictor of aggregation propensity
VH/VL interface hydrophobic patch area — predictive of expression yield and aggregation, partially independent from SAP
Net charge at pH 6.0 (formulation pH for many mAb products) — correlated with viscosity at high concentration
CDR H3 Trp/Phe density — aromatic residues in H3 are disproportionately represented in polyreactive antibodies
Framework region charge asymmetry (VH vs. VL net charge difference) — associated with self-association behavior
Predicted thermal stability index (derived from contact order and hydrophobic burial in framework) — predictive of early-stage colloidal stability
N-glycosylation motif count in CDRs (NxS/T sequons, where x ≠ P) — direct chemical liability

These seven metrics together, in a simple logistic regression model, achieved an AUC of 0.73 on a held-out test set (80 antibodies withheld from model training). That's not a definitive pass/fail oracle, but it's meaningfully better than random (AUC 0.5) and practically useful for prioritizing a synthesis queue of 50–100 candidates where you can synthesize maybe 20–30.

What Didn't Predict Well

The isoelectric point (pI) at pH 7.4 gets cited frequently as a developability metric — high pI antibodies are often associated with polyreactivity and poor PK in rodent models. In our dataset, pI alone had an AUC of only 0.58. That's barely above random for distinguishing CMC failures. The problem is that pI is a coarse scalar derived from all charged residues in the full antibody, while the relevant behavior — surface charge distribution, local electrostatics near the paratope — requires more granular calculation. Net charge at formulation pH, especially in the CDRs and framework surface, outperformed whole-protein pI in every regression we ran.

GRAVY score (grand average of hydropathicity), a simple sequence-level hydrophobicity metric, also underperformed. It's useful for detecting grossly hydrophobic sequences, but therapeutic antibodies occupy a narrow enough hydrophobicity range that GRAVY score adds little discriminatory power within that window. The spatially resolved SAP score — which accounts for which hydrophobic residues are solvent-exposed and clustered — is substantially more informative.

We're not saying pI and GRAVY are useless for all purposes. A candidate with pI > 9.5 or a strongly positive GRAVY score is worth noting. We're saying that using them as primary hard-filter criteria, as some groups do, is likely to reject candidates for the wrong reasons while failing to flag candidates with subtler aggregation liabilities.

The SAP Score: Nuances in Application

SAP is probably the most widely used single-metric developability filter in computational antibody design. The original Chennamsetty et al. formulation computes, for each solvent-exposed residue, a weighted sum of the hydrophobic contribution of neighboring residues within a 10 Å sphere, normalized by surface accessibility. The score identifies "sticky patches" where hydrophobic surface area is concentrated.

A detail that matters: whether you compute SAP over the full antibody surface or restrict it to the CDR region changes the interpretation significantly. Full-antibody SAP is dominated by framework surface contributions that are largely conserved across antibody sequences and don't vary much between candidates. CDR-local SAP, focusing on the 20–30 Å radius around the center of the paratope, picks up the specific hydrophobic liabilities introduced by CDR sequence optimization — particularly Trp and Phe substitutions used to improve affinity. The two calculations can give opposite rankings for the same candidate set.

We use CDR-local SAP as the primary metric and flag full-antibody SAP separately only when framework-region mutations are present. This distinction is one that's easy to miss when applying off-the-shelf developability software without reading the underlying calculation assumptions.

Applying Thresholds vs. Composite Scores

Hard thresholds on individual metrics are operationally convenient but scientifically crude. A candidate with CDR SAP = 0.52 is not categorically different from one with CDR SAP = 0.48; the threshold introduces false precision. In practice, we use a composite score — a weighted sum of the seven retained metrics, with weights derived from the logistic regression coefficients on our 400-antibody training set — and report candidates in terciles rather than binary pass/fail. The bottom tercile (score ≤ 25th percentile) gets a yellow flag requiring review; below the 10th percentile is a hard recommendation against synthesis without sequence engineering.

Composite scoring also allows explicit trade-off analysis: a candidate with a marginal CDR SAP score but excellent thermal stability and low charge asymmetry may be worth advancing, particularly if the affinity data is strong. A candidate that fails multiple metrics simultaneously — high SAP, low thermal stability index, and a CDR H3 NxT motif — gets deprioritized regardless of affinity.

Where Sequence-Level Scoring Reaches Its Limits

Sequence-computable metrics have a ceiling. They can't capture conformational dynamics, concentration-dependent behavior, or the specific interactions that emerge in the context of a particular formulation buffer and excipient. A candidate with acceptable sequence-level scores can still fail biophysical characterization — dynamic light scattering (DLS) aggregation above 1% at 10 mg/mL, unacceptable viscosity at 100 mg/mL, or poor HIC retention as a proxy for surface hydrophobicity.

The value of sequence-level scoring is not that it eliminates late-stage surprises — it won't. The value is that it concentrates synthesis effort on the subset of candidates where late-stage biophysical characterization is likely to succeed. In a campaign with 80 ranked candidates and resources to synthesize 25, directing synthesis toward the top-scoring 25 by composite developability score should increase the fraction that pass DLS and viscosity screening from the baseline expected rate. That's the marginal gain being claimed, and it's a realistic one.

For programs where we've been involved early enough to run sequence scoring before synthesis, our internal tracking shows roughly a 1.4–1.6x improvement in biophysical pass rate among synthesized candidates versus the baseline rate we'd expect from random selection within the same discovery campaign. That's useful but not dramatic — consistent with a tool that filters obvious failures while leaving the tail risk of unexpected biophysical behavior intact. Engineering programs should still budget for a biophysical characterization round regardless of how good the sequence scores look.