Accurate quantification in proteomics with QuantUMS – Nature Biotechnology

Liquid chromatography coupled to mass spectrometry (LC–MS)-based bottom-up proteomics is a highly dynamic field that is rapidly evolving. Recent technological developments, including faster, more sensitive instruments, novel acquisition methods and advanced data processing, have pushed the limits of throughput and sensitivity. This progress has reduced costs and increased reproducibility, establishing proteomics as a powerful tool for basic biological discovery, translational research and a potential foundation for personalized medicine¹.

Data-independent acquisition (DIA)^2,3,4 proteomics has grown in popularity. Advances in both instrumentation and analysis software have addressed its earlier limitations, while strengthening its core advantages: high proteomic depth, data completeness and improved quantitative performance¹. Furthermore, a number of recent DIA technologies now offer the reliable fragment-to-precursor mass assignment that was previously the main strength of data-dependent acquisition (DDA), promising to expand the applicability of DIA even further^5,6,7,8. Improvements in sensitivity^{9,10,11,12,13,14,15} and the rise of multiplexed DIA^{16,17,18,19,20,21,22} suggest that DIA will expand into applications that have traditionally relied on targeted methods.

Much of the progress in DIA has been driven by data processing software improvements^{23,24,25,26,27,28,29,30}, with new algorithms enabling rapid gains in the number of proteins identified. Advanced machine learning now allows peptides to be confidently matched to recorded signals, even in the presence of noise and signal interferences from coeluting and cofragmenting peptide species. This capability, however, raises a critical question: how reliable are the quantitative values derived from these extra identifications and how much do they benefit the biological conclusions? This issue is especially important for modern, high-sensitivity, high-throughput workflows, such as single-cell proteomics or spatial tissue proteomics, that generate challenging data at scales of hundreds of samples per day^{5,9,11,21,31,32,33,34,35,36,37}.

Substantial effort has been directed toward developing computational methods that can improve quantitative reproducibility, precision and accuracy of proteomic experiments. These include deconvolution of spectra³⁸, selection of peptide fragment ions based on the signal quality^21,26,39 and protein quantification through aggregation of multiple parallel sources of quantitative information, such as peptide-level MaxLFQ for DDA⁴⁰ and fragment-level MaxLFQ for DIA⁴¹ or directLFQ⁴². While advanced methods for error control and missing data handling have also been developed for statistical analysis of proteomics data, as discussed and benchmarked recently^43,44, a fundamental problem remains: while peptide identification error rates are reliably controlled (for example, using target–decoy competition), it is currently challenging to reliably estimate quantification errors.

In each tandem MS acquisition, the MS instrument records multiple intensity signals for each detected peptide precursor: the unfragmented precursor intensity (MS1) and the signals from its fragment ions (MS/MS). The quantification algorithm previously implemented in our DIA-NN software (‘legacy’ DIA-NN mode) quantifies each precursor by summing the three highest-quality fragments selected across runs using correlation-based scores²⁶. This approach enables filtering out signals that are strongly affected by interferences in multiple runs but remains susceptible to interferences observed in individual acquisitions. Here, much of the available information, including MS1 signal, is discarded, while recent works showcase the relevance of MS1 information for more accurate quantification in multiplexed DIA²¹ and for downstream statistical analysis⁴⁵. Integrating all available quantitative features instead (MS1, MS/MS signals) should naturally allow achieving higher accuracy and precision than using a limited subset. However, measured signals are subject to errors caused by random noise^46,47 and ‘interferences’ (ref. ⁴), requiring an integration algorithm to account for these.

To integrate all of the available MS1 and MS/MS information in a statistically justified manner, we devised QuantUMS (quantification using an uncertainty-minimizing solution; Fig. 1a). QuantUMS implements an algorithm that is capable of performing quantification of precursors using any subset of quantitative features, wherein the relative contributions of features are determined by their respective quality metrics and the hyperparameters of the algorithm, by modeling the statistical properties of LC–MS-derived errors. QuantUMS then optimizes its hyperparameters toward two goals. First, concordance of relative precursor quantities calculated using distinct quantitative features is maximized, improving precision. Here, QuantUMS builds on the idea that the ratios between quantitative features of the same precursor are expected to be consistent across acquisitions. Second, hyperparameters are also tuned to make the distributions of quantitative ratios obtained using high-quality and low-quality signals similar, tackling any ratio compression bias that affects noisy quantities and improving overall accuracy. Lastly, the optimized algorithm is used to quantify first precursors and then proteins, using all available quantitative features (Fig. 1a, Methods and Supplementary Information).

Accurate quantification in proteomics with QuantUMS – Nature Biotechnology — **Fig. 1: QuantUMS performs statistically justified minimization of quantification uncertainty.**

QuantUMS thereby addresses two central challenges of LC–MS proteomics. First, measured signals may be subject to interferences, which bias their quantities upwards and thereby cause ratio compression—an effect that becomes more pronounced with decreasing precursor abundance. Without requiring any knowledge of the experiment design, QuantUMS can estimate and effectively eliminate such bias, at the cost of a decrease in precision. Balancing this trade-off, the QuantUMS module in DIA-NN implements two preconfigured modes termed high-precision and high-accuracy, which we benchmark in the present work. Second, controlling for quantification errors has so far been challenging in proteomics, mostly relying on precursor or protein coefficients of variation (CVs). However, CVs neither control for interference-caused systematic errors that might severely impact the accuracy of quantification while preserving precision nor reflect errors that only manifest in individual acquisitions. Therefore, ensuring confidence of observations pertaining to specific proteins necessitates laborious, potentially biased and sometimes technically impossible manual checks of extracted ion chromatograms for each acquisition and peptide of interest. With QuantUMS, we mitigate this problem by introducing a quantity-specific quality metric, which enables effective filtering for confident downstream analysis and statistical inference.

To evaluate the performance of QuantUMS, we carried out benchmarks on multiple DIA datasets covering synthetic as well as biological experiments, different instrument types and experimental scales.

First, we benchmarked QuantUMS on a mixed-species dataset recorded on a timsTOF Pro³⁷, where the quantitative ground truth is generated by mixing human (K562) and Escherichia coli tryptic digests in different predefined ratios. On such mixed-species datasets, one can evaluate the ability of the instrument and the data-processing software to correctly recover these ratios, examining the mean absolute log₂ deviation (MAD) of the constant species (human) as a proxy for quantitative precision and the MAD of the differential species (in this case, E. coli) as a proxy for quantitative accuracy. With the accuracy of sample preparation confirmed by examining high-quality MS1 signals (Supplementary Fig. 1c), we compared QuantUMS high-precision and high-accuracy modes to the legacy DIA-NN quantification mode (Fig. 1b, Supplementary Fig. 2a and Methods). QuantUMS high-precision mode yielded the best precision (human protein MAD = 0.10) while reducing the MAD of E. coli ratios 1.7-fold (0.33 to 0.19). The high-accuracy mode of QuantUMS further reduced the MAD of E. coli ratios 2.2-fold compared to legacy mode (0.33 to 0.15), eliminating ratio compression while maintaining comparable precision. The observed improvements are robust to tuning QuantUMS hyperparameters on just a subset of samples (A and/or C; Methods), where results (Supplementary Fig. 2b) are comparable to those obtained by training on the entire dataset A + B + C (rightmost panels in Fig. 1b and Supplementary Fig. 2a). This indicates that the optimized hyperparameters of QuantUMS reflect the inherent properties of the LC–MS setup and the sample matrix but do not reflect the experiment design. Furthermore, the QuantUMS results are robust to varying the threshold that QuantUMS applies internally to select high-quality signals used as a reference for bias minimization (Supplementary Information and Supplementary Fig. 2c). Filtering the dataset on the basis of acquisition-specific precursor-level and protein-level quantity and quality metrics introduced in QuantUMS (Fig. 1c and Supplementary Fig. 1b,d) shows that, as intended, only accurately quantified precursors and proteins are retained, with 0.75 quality quantile protein-level filtering resulting in almost perfect protein ratios (Fig. 1c, right). On mixed-species datasets acquired using other LC–MS platforms, involving TripleTOF 6600 (ref. ⁴⁸) (Supplementary Fig. 3) or Orbitrap Astral⁴⁹ (Supplementary Fig. 4) MS instruments, QuantUMS likewise shows improvements in quantitative performance.

To evaluate the robustness of QuantUMS with respect to variations in experiment size and inclusion of vastly different loading amounts, we used our recently recorded mixed-species dataset⁵⁰, measured with DIA parallel accumulation–serial fragmentation (dia-PASEF) at amounts spanning a tenfold range. Compared to the legacy mode, QuantUMS high-accuracy mode alleviated or even eliminated ratio compression and substantially improved overall accuracy across all considered loading amounts, despite minor ratio expansion becoming apparent for high-abundant proteins at low loads (Supplementary Fig. 5a). These observed accuracy gains remained consistent, regardless of whether QuantUMS hyperparameters were optimized on the entire experiment (144 runs; Supplementary Fig. 5a) or just subsets thereof (8–96 runs; Supplementary Fig. 5b).

Next, we investigated what effect the enhanced performance of QuantUMS has on differential expression analyses as a common downstream application. On two out of three considered mixed-species datasets, QuantUMS yielded higher numbers of differentially expressed proteins (Welch’s t-test) than legacy mode at the same empirical FDR (Supplementary Fig. 6, left). We then speculated that the ability of QuantUMS to alleviate ratio compression and, thus, improve the linearity of the quantitative response may prove important when considering experiments where large normalization factors need to be applied, such as any applications where sample loading amounts are not equalized and applications with inherently diverse samples, including single-cell or spatial proteomics. In a benchmark to simulate this situation, we did indeed observe both QuantUMS modes to strongly reduce the false-positive numbers compared to the legacy mode after applying strong normalization, at the same adjusted P values (Supplementary Fig. 6, middle), with greater numbers of proteins correctly detected as differentially expressed at a given empirical FDR threshold (Supplementary Fig. 6, right). Therefore, we hypothesize that the enhanced accuracy (linearity) of QuantUMS may reduce false positives in calling differentially expressed proteins, although whether this conclusion also holds for actual experiments of interest, as opposed to a highly artificial benchmark here, remains to be investigated with an appropriate experiment design.

To evaluate the performance of QuantUMS in the presence of biological and sample preparation variation, we carried out differential expression analysis on a human fibroblast perturbation dataset⁵¹. QuantUMS more than doubled the numbers of differentially expressed proteins at both 1% and 5% FDR compared to legacy quantification and more than tripled them compared to DIA-NN 1.8, which was used in the original publication (Fig. 2a). Filtering the protein lists on the basis of the averaged across-runs protein quantity quality metric, to reduce the multiple testing burden, further improved the numbers of significant proteins at a given FDR (Fig. 2a, bottom). Here, filtering by the quality metric marginally outperformed the ‘naive’ approach of filtering on the basis of the average estimated protein quantity (top 1 method). Testing differential expression on a dataset of a cohort of persons with chronic lymphocytic leukemia (CLL)⁵² against the supplied characteristics showed that QuantUMS increased the numbers of proteins identified as differentially expressed compared to the legacy mode in all tests (Fig. 2b). By design, QuantUMS reduces variation originating from LC–MS measurement errors. The benefit of QuantUMS in a given experiment, therefore, depends on the relative contribution of LC–MS errors compared to the variation of biological or sample preparation origin. Consistent with this, the greater advantage of QuantUMS in the fibroblast dataset likely relates to the lower biological variability of cultured cells compared to the CLL dataset, where individual samples can be expected to exhibit greater heterogeneity, including proteoform-level variation, which QuantUMS is not designed to address.

**Fig. 2: QuantUMS boosts the sensitivity of differential expression analysis.**

With QuantUMS, we address a long-standing problem of untargeted proteomics, that is, the lack of quality control for peptide and protein quantities obtained in an experiment. We show that taking into account the quality information available for individual signals recorded by the MS instrument allows to not only improve quantitative performance per se but also produce effective quality metrics to ensure confidence in the data and further empower the subsequent statistical analyses. So far, we have benchmarked QuantUMS on DIA proteomics data, using DIA-NN’s quality scores, but the algorithm is open to incorporating novel quality scores and other acquisition approaches that likewise generate multiple channels of quantitative information, including selected and parallel reaction monitoring, as well as other experiments that involve recording multiplexed MS/MS spectra. We further envision great potential for future improvements in quantitative proteomics to be achieved by integrating QuantUMS with downstream statistical analysis approaches, such as MSStats⁴³ or Triqler⁴⁴, to enable biological inference that is fully aware of all kinds of uncertainty, missingness and normalization issues in raw proteomics data.

Leave a Reply Cancel reply

Related News

私の母は、私の人生のほとんどの間、私に大きな秘密を隠し続けてきました。そしたら、わかったんです。

何百万人もの乳がん患者が化学療法を安全に回避できた可能性があることが研究で示唆されている

スティーブ・ヒルトン、カリフォルニア州知事選への驚くほど強力な立候補について語る

結婚すると思っていた男性と別れました。これが実際に私が先に進むのに役立ったものです。