Peptidoforms parsed from MaxQuant output are not always valid
Hi!
I am trying to parse a few different MaxQuant output files and the modified sequences extracted from the file are not parsed as ProForma correctly. Here is what I am dealing with. Here is one row from one file:
Raw file Scan number Scan index Sequence Length Missed cleavages Modifications Modified sequence Oxidation (M) Probabilities Oxidation (M) Score Diffs Oxidation (M) Proteins Charge Fragmentation Mass analyzer Type Scan event number
01974c_BA1-TUM_missing_first_1_01_01-3xHCD-1h-R4 18859 16929 ACVINGMQLK 10 0 Oxidation (M) _ACVINGM(ox)QLK_ ACVINGM(1)QLK ACVINGM(101.9)QLK 1 TUM_missing_first_1 2 HCD FTMS MULTI-MSMS 31 0 575.29138 1148.5682 0.17725 1.7827292 24.19 0.00036581 101.9 100.71 101.9 1 1 0.812871 0.009392453 0.1333722 18828 6096732 0.426561 -2 0.06135368 y1;y2;y3;y4;y5;y6;y7;y8;y9;y1-NH3;y6-NH3;y7-NH3;y8-NH3;y8(2+);a2;b2;b3;b4;b5 71809.2;29973.1;24011.2;32854.1;190928.3;556904.4;805166.8;685423.2;105967.8;11586.8;26761.8;29306.5;32482.7;10282.6;114158.7;652840.1;435104.2;53264.8;28631 -0.0004923382;-0.0005195443;-0.003007707;0.0006837579;0.0004915272;-0.0003559902;-0.0005144462;0.0002259294;-0.00348233;0.0003633455;-0.00334429;-0.002593298;-0.006689147;0.003100681;4.57087E-05;-6.530397E-05;-0.0001010637;-0.0003278859;-0.002078173 -3.34666;-1.996731;-7.746661;1.277359;0.8298453;-0.5039816;-0.6278023;0.2459745;-3.228739;2.79312;-4.851494;-3.231865;-7.420119;6.744212;0.2239743;-0.2813915;-0.305196;-0.7381029;-3.722506 147.113296508789;260.197387695313;388.258453369141;535.290161132813;592.311817087127;706.355592051727;819.439814488134;918.507488028666;1078.54184448957;130.085891723633;689.33203125;802.415344238281;901.487854003906;459.75439453125;204.080078125;232.075103759766;331.143553435689;444.227844238281;558.272521972656 19 0.5166685 0.2289157 None Unknown 101.8962;1.189599;0.2601814 ACVINGMQLK;LKDSEGSGTAGK;DAHKSEVAHR _ACVINGM(ox)QLK_;_LKDSEGSGTAGK_;_DAHKSEVAHR_ 66 8 7 7 15 1
and now another file:
Raw file Scan number Scan index Sequence Length Missed cleavages Modifications Modified sequence Oxidation (M) Probabilities Phospho (STY) Probabilities Oxidation (M) Score diffs Phospho (STY) Score diffs Acetyl (Protein N-term) Oxidation (M)
OXPAL230121_44 14613 7897 AAAEGEMK 8 0 Oxidation (M) _AAAEGEM(Oxidation (M))K_ AAAEGEM(1)K AAAEGEM(79)K 0 1 0 P0A9B2 gapA Glyceraldehyde-3-phosphate dehydrogenase A 2 HCD FTMS MULTI-MSMS 1 0.0 411.68674 821.35892 0.31799 0.00013091 -0.57361634 25.967 0.0047701 79.116 68.3 79.116 1.0 1 0 0 0 14612 8211736.5 0.0651129635157789 -8 0.0703334808349609 y1;y2;y4;y5;y6;y7;y5-H2O;y6-H2O;y1-NH3;a2;b2;b3 162904.734375;75995.5625;159223.859375;106600.2265625;201834.5625;48688.8359375;47573.99609375;7619.5693359375;65347.02734375;367253.65625;314543.28125;40487.2734375 2.646063904876428E-05;-0.0001748173209534798;0.00041258822591316857;-0.0005543576077116086;-0.00011734497638826724;0.00025578379938906437;-7.163136046983709E-05;-0.0024110601625579875;-6.3900626571467E-05;3.3939142966232794E-05;3.7367395322007724E-05;-3.99617186985779E-06 0.17986635464753814;-0.5943167934958867;0.8591796057282154;-0.9098936188837109;-0.17249205021037203;0.34044188222478894;-0.12115356235845015;-3.6405240676221244;-0.4912171170462424;0.2949010231855326;0.2611616737682475;-0.018663355086892004 147.11277770996094;294.148378216221;480.2118476304741;609.2554076725077;680.2920844476763;751.3288251067005;591.2443602599604;662.2838134765625;130.08631896972656;115.08655548095703;143.0814666748047;214.11862182617188 12 0.303405304480312 0.0923076923076923 Unknown 79.11602402068965;10.816500709941389;10.816500709941389 AAAEGEMK;ALNDMDK;SGDEWTK _AAAEGEM(Oxidation (M))K_;_ALNDM(Oxidation (M))DK_;_SGDEWTK_ 177 495 7 7 133 639
So when reading the PSMs with psm_utils, you get the following peptidoforms: Peptidoform('ACVINGM[ox]QLK/2') and Peptidoform('AAAEGEM[Oxidation (M)]K/2'), respectively.
If you then try to calculate masses, neither will give the correct result. The first one will actually resolve ox as carboxymethyl because the last-ditch attempt at resolving in Pyteomics is currently a very permissive Unimod search; while the other will just raise an exception. In both files there is a Modifications column where you have the same form for the modification: Oxidation (M). Looks like if we remove the site, we can use the name and have a much better chance of getting a consistent ProForma. However I'm not sure how many other kinds of MaxQuant tables are out there.
Hi Lev!
Because the MaxQuant modification labels used to be very non-specific (like the ox example), we opted to take over the labels as-is to the ProForma string. While this technically renders illegal ProForma entries, they are still parseable. In psm_utils, we then provide the rename_modifications method to convert the MaxQuant labels to legal proforma entries, like Unimod (also see the relevant section in the MS²Rescore docs.
When MaxQuant changed the labels to be more descriptive (like Oxidation (M), we considered automatically applying a mapping. Unfortunately, that would still not include user-defined modifications. Nevertheless, as such a mapping would cover most of the use cases, I'm open to reconsider.
Best, Ralf
Thank you for the response!
It's good to know about the rename_modifications method, although an automatic mapping would definitely make things much easier. Is there a reason not to rely on the Modifications column?
Do you think psm_utils could attempt matching and removing the (site) part to get the modification name? This could fallback to current behavior silently and should do no harm?
As far as I see, the Modifications column does not provide more info beyond what's in the modified peptide?
I think the best option would be to automatically apply a mapping that includes the default MaxQuant modification list, with some logging to notify the users of this action, and perhaps a warning when not all modifications could be mapped. The logging could be implemented as part of the PSMList-level rename_modifications method. Feel free to open a draft PR if you'd have the time.
In my first example the Modifications column can be used as a source for the mapping. I have no experience using MaxQuant; perhaps there is a more exhaustive source for name mapping than that?