Benchmarking Rosetta Scoring Against Experimental ΔΔG Data for PPI Interface Mutations

The Role of Rosetta in PPI Interface Scoring

Rosetta has been a central tool in computational structural biology for more than two decades. Its protein-protein interface scoring capabilities — implemented through the InterfaceAnalyzer mover, the interface_delta_X scoring terms, and the broader ref2015 energy function — are used across the PPI modeling field for interface energetics calculation, computational alanine scanning, protein design, and relative stability prediction. Rosetta's scoring functions represent a significant scientific investment: they are physics-based in their treatment of van der Waals interactions and electrostatics, supplemented by statistical potentials for hydrogen bond geometry and side chain packing, with weights calibrated against experimental protein stability and binding affinity data.

For PPI drug discovery, the most critical application of Rosetta interface scoring is the prediction of per-residue ΔΔG values — the change in binding free energy upon mutation of each interface residue. This is what computational alanine scanning does: substituting each interface residue to alanine in silico and calculating the ΔΔG from the Rosetta interface score. Hot-spot residues are identified as those where ΔΔG exceeds a defined threshold, and the quantitative values inform pharmacophore design and compound prioritization.

The accuracy of Rosetta ΔΔG predictions for PPI interfaces has been studied extensively, and the results are informative but mixed. Understanding exactly where Rosetta is accurate, where it is systematically biased, and where it fails outright is essential for interpreting its outputs correctly in a drug discovery context. This benchmarking analysis characterizes those strengths and failure modes against curated experimental datasets for oncology PPI complexes.

Benchmarking Datasets and Methodology

Our benchmarking uses curated subsets of the SKEMPI v2.0 dataset, restricted to oncology-relevant PPI complexes where single-point alanine or non-alanine mutations have measured experimental ΔΔG values from isothermal titration calorimetry (ITC), surface plasmon resonance (SPR), or fluorescence anisotropy. We supplement SKEMPI with additional curated experimental data for specific target systems — MDM2-p53 variants, BCL-2/BH3 mutations, and KRAS effector interface mutations — where more complete published datasets are available in the primary literature.

The curated oncology PPI ΔΔG dataset contains approximately 280 single-point mutations across 9 target complexes after removing entries with high experimental uncertainty (error bars exceeding ±0.5 kcal/mol in the original measurement), mutations outside the interface as defined by buried surface area analysis, and disulfide-bonded or glycosylated positions where Rosetta's treatment is known to be unreliable. This is not a large dataset by the standards of enzyme-ligand benchmarking, but it is the relevant scope for calibrating PPI-specific scoring.

Rosetta calculations used the cartesian_ddg protocol with the ref2015_cart scoring function, which has been shown in prior benchmarking to outperform the classic ddg_monomer protocol for ΔΔG prediction at protein-protein interfaces. Twenty independent runs per mutation with coordinate minimization were averaged to reduce sampling noise. All calculations were performed on the crystal structure input after energy minimization with FastRelax; we report both the single-best-input-structure result and the ensemble-averaged result where multiple crystal structures were available.

Overall Accuracy: Where Rosetta Performs Well

Across the full 280-mutation benchmarking set, Rosetta cartesian_ddg achieves a Pearson correlation coefficient of approximately r = 0.62 between predicted and experimental ΔΔG values (R² ≈ 0.38). The mean absolute error is approximately 0.85 kcal/mol, and the root mean square error is approximately 1.15 kcal/mol. These numbers are broadly consistent with published benchmarks of Rosetta ΔΔG prediction on larger, more general datasets, suggesting that the oncology PPI subset does not present anomalously difficult prediction challenges.

At the hot-spot identification task — classifying mutations as hot-spot (|ΔΔG| > 1.5 kcal/mol) or non-hot-spot — Rosetta achieves accuracy of approximately 72% with a positive predictive value of approximately 0.68 for hot-spot identification. False negative rate (missing true hot spots) is approximately 0.22 in the benchmarking set. These numbers mean that Rosetta correctly identifies roughly two-thirds to three-quarters of hot-spot residues, but misses approximately one in five — and the compounds you don't prioritize because their contact residue was incorrectly classified as non-critical is a real cost in a drug discovery program.

Performance is notably better within the MDM2-p53 subset than across the heterogeneous full set. For MDM2-p53 specifically, Pearson r reaches approximately 0.74 and hot-spot identification accuracy exceeds 80%. This reflects the quality and diversity of the structural data available for this target — multiple crystal structures allow ensemble averaging that substantially reduces the noise from single-structure calculation artifacts.

Systematic Biases and Failure Modes

Several systematic biases in Rosetta ΔΔG prediction are evident from the benchmarking data and have documented mechanistic explanations.

Overestimation of burial penalties for charged residues. Rosetta consistently overestimates the destabilizing effect of mutating charged interface residues (arginine, lysine, glutamate, aspartate) to alanine. The overestimation is typically 0.5–1.5 kcal/mol compared to experimental values. The mechanistic origin is the treatment of salt bridges and charged residue desolvation in the Rosetta energy function — the continuum electrostatics treatment underestimates the actual screening effect of water at the interface, leading to overestimated electrostatic contributions. For targets like STAT3 where the interface has a significant electrostatic component (phosphotyrosine mimicry, basic groove contacts), this bias meaningfully affects the ΔΔG ranking of interface residues and needs to be corrected by applying a calibrated offset to charged-residue predictions.

Underestimation of aromatic interaction energetics. Tryptophan, tyrosine, and phenylalanine mutations to alanine are systematically underestimated in Rosetta ΔΔG calculations — the predicted ΔΔG for removing an aromatic contact is smaller than the experimental value in 65% of benchmarked cases with |ΔΔG_exp| > 1.0 kcal/mol. The W23 position in MDM2-p53 is a specific example: the experimental ΔΔG for W23A is approximately 3.8 kcal/mol; our Rosetta calculations predict approximately 2.9 kcal/mol. The 1 kcal/mol underestimation is consistent across multiple structural inputs and calculation protocols. This underestimation is relevant because tryptophan and phenylalanine are frequently hot-spot residues at hydrophobic PPI interfaces — systematic underestimation of their contribution leads to incorrect relative ranking of hot-spot positions.

Sensitivity to input structure quality. The benchmarking shows substantially higher variance in predictions for targets where only a single input structure is available compared to targets with structural ensembles. For AlphaFold2-predicted complex structures, the prediction variance is higher still — the coordinate uncertainty in predicted structures propagates to ΔΔG prediction uncertainty that is not captured by the single-point Rosetta output. We quantify this as the standard deviation of ΔΔG values across the structural ensemble: for targets with structural ensembles, this ranges from 0.3–0.9 kcal/mol per residue; for single-structure inputs, it is not estimable from the input data but is expected to be larger.

Comparison to Alternative Approaches

Free energy perturbation (FEP) methods — specifically FEP+ as implemented in commercial packages and alchemical FEP as available in open-source tools — produce substantially better ΔΔG prediction accuracy than Rosetta for the cases where comparison is possible. In our benchmarking subset where FEP calculations are computationally feasible (smaller proteins, mutations without large structural rearrangements), FEP achieves Pearson r ≈ 0.80–0.85 versus r ≈ 0.62 for Rosetta. The accuracy improvement is genuine and significant.

The tradeoff is throughput. Rosetta ΔΔG calculation for a 20-residue interface alanine scan completes in minutes to hours on a standard compute node. FEP calculation for a single mutation requires nanosecond-scale alchemical transformation simulations, running to hours or days per mutation on equivalent hardware. For an interface with 30 residues to scan, FEP is 50–100× more computationally expensive per residue than Rosetta. For a virtual library of 100,000 compounds where per-compound ΔΔG disruption contributions need to be estimated, FEP is effectively intractable at current computational costs. Rosetta is the appropriate tool for interface characterization and hot-spot identification at screening scale; FEP is the appropriate validation tool for a small set of prioritized compounds where high-confidence ΔΔG prediction is worth the computational investment.

Calibration Corrections for Operational Use

Based on the benchmarking analysis, we apply two calibration corrections to Rosetta ΔΔG outputs before using them in hot-spot identification:

A charged-residue correction factor (approximately +0.7 kcal/mol subtracted from predicted ΔΔG for charged-to-alanine mutations, based on the mean benchmarked bias) that reduces overestimation of charged residue hot-spot contributions.
An aromatic-residue amplification factor (approximately ×1.25 scaling of predicted ΔΔG for Trp/Tyr/Phe-to-alanine mutations) that corrects for systematic underestimation of aromatic hot-spot contributions.

These corrections are target-class-specific and are recalibrated when new experimental ΔΔG data becomes available for a target system in our panel. They are applied consistently across all interface characterization reports and disclosed in the methodology section of each characterization output — a scoring correction that is applied invisibly is epistemically worse than one that is documented and verifiable.

The benchmarking is ongoing. As experimental ΔΔG datasets for additional oncology PPI complexes become available — particularly for the less-characterized targets like Wnt pathway interfaces and STAT3 — the calibration corrections will be updated and the accuracy bounds revised. A scoring function you understand fails in predictable ways is more useful for drug discovery decision-making than one that fails randomly.