psm_utils icon indicating copy to clipboard operation
psm_utils copied to clipboard

Peptidoforms parsed from MaxQuant output are not always valid

Open levitsky opened this issue 1 month ago • 4 comments

Hi!

I am trying to parse a few different MaxQuant output files and the modified sequences extracted from the file are not parsed as ProForma correctly. Here is what I am dealing with. Here is one row from one file:

Raw file	Scan number	Scan index	Sequence	Length	Missed cleavages	Modifications	Modified sequence	Oxidation (M) Probabilities	Oxidation (M) Score Diffs	Oxidation (M)	Proteins	Charge	Fragmentation	Mass analyzer	Type	Scan event number
01974c_BA1-TUM_missing_first_1_01_01-3xHCD-1h-R4	18859	16929	ACVINGMQLK	10	0	Oxidation (M)	_ACVINGM(ox)QLK_	ACVINGM(1)QLK	ACVINGM(101.9)QLK	1	TUM_missing_first_1	2	HCD	FTMS	MULTI-MSMS	31	0	575.29138	1148.5682	0.17725	1.7827292	24.19	0.00036581	101.9	100.71	101.9	1	1	0.812871	0.009392453	0.1333722	18828	6096732	0.426561	-2	0.06135368	y1;y2;y3;y4;y5;y6;y7;y8;y9;y1-NH3;y6-NH3;y7-NH3;y8-NH3;y8(2+);a2;b2;b3;b4;b5	71809.2;29973.1;24011.2;32854.1;190928.3;556904.4;805166.8;685423.2;105967.8;11586.8;26761.8;29306.5;32482.7;10282.6;114158.7;652840.1;435104.2;53264.8;28631	-0.0004923382;-0.0005195443;-0.003007707;0.0006837579;0.0004915272;-0.0003559902;-0.0005144462;0.0002259294;-0.00348233;0.0003633455;-0.00334429;-0.002593298;-0.006689147;0.003100681;4.57087E-05;-6.530397E-05;-0.0001010637;-0.0003278859;-0.002078173	-3.34666;-1.996731;-7.746661;1.277359;0.8298453;-0.5039816;-0.6278023;0.2459745;-3.228739;2.79312;-4.851494;-3.231865;-7.420119;6.744212;0.2239743;-0.2813915;-0.305196;-0.7381029;-3.722506	147.113296508789;260.197387695313;388.258453369141;535.290161132813;592.311817087127;706.355592051727;819.439814488134;918.507488028666;1078.54184448957;130.085891723633;689.33203125;802.415344238281;901.487854003906;459.75439453125;204.080078125;232.075103759766;331.143553435689;444.227844238281;558.272521972656	19	0.5166685	0.2289157	None	Unknown		101.8962;1.189599;0.2601814	ACVINGMQLK;LKDSEGSGTAGK;DAHKSEVAHR	_ACVINGM(ox)QLK_;_LKDSEGSGTAGK_;_DAHKSEVAHR_	66	8	7	7	15	1

and now another file:

Raw file	Scan number	Scan index	Sequence	Length	Missed cleavages	Modifications	Modified sequence	Oxidation (M) Probabilities	Phospho (STY) Probabilities	Oxidation (M) Score diffs	Phospho (STY) Score diffs	Acetyl (Protein N-term)	Oxidation (M)
OXPAL230121_44	14613	7897	AAAEGEMK	8	0	Oxidation (M)	_AAAEGEM(Oxidation (M))K_	AAAEGEM(1)K		AAAEGEM(79)K		0	1	0	P0A9B2	gapA	Glyceraldehyde-3-phosphate dehydrogenase A	2	HCD	FTMS	MULTI-MSMS	1	0.0	411.68674	821.35892	0.31799	0.00013091	-0.57361634	25.967	0.0047701	79.116	68.3	79.116	1.0	1	0	0	0	14612	8211736.5	0.0651129635157789	-8	0.0703334808349609		y1;y2;y4;y5;y6;y7;y5-H2O;y6-H2O;y1-NH3;a2;b2;b3	162904.734375;75995.5625;159223.859375;106600.2265625;201834.5625;48688.8359375;47573.99609375;7619.5693359375;65347.02734375;367253.65625;314543.28125;40487.2734375	2.646063904876428E-05;-0.0001748173209534798;0.00041258822591316857;-0.0005543576077116086;-0.00011734497638826724;0.00025578379938906437;-7.163136046983709E-05;-0.0024110601625579875;-6.3900626571467E-05;3.3939142966232794E-05;3.7367395322007724E-05;-3.99617186985779E-06	0.17986635464753814;-0.5943167934958867;0.8591796057282154;-0.9098936188837109;-0.17249205021037203;0.34044188222478894;-0.12115356235845015;-3.6405240676221244;-0.4912171170462424;0.2949010231855326;0.2611616737682475;-0.018663355086892004	147.11277770996094;294.148378216221;480.2118476304741;609.2554076725077;680.2920844476763;751.3288251067005;591.2443602599604;662.2838134765625;130.08631896972656;115.08655548095703;143.0814666748047;214.11862182617188	12	0.303405304480312	0.0923076923076923		Unknown		79.11602402068965;10.816500709941389;10.816500709941389	AAAEGEMK;ALNDMDK;SGDEWTK	_AAAEGEM(Oxidation (M))K_;_ALNDM(Oxidation (M))DK_;_SGDEWTK_				177	495	7	7	133	639

So when reading the PSMs with psm_utils, you get the following peptidoforms: Peptidoform('ACVINGM[ox]QLK/2') and Peptidoform('AAAEGEM[Oxidation (M)]K/2'), respectively.

If you then try to calculate masses, neither will give the correct result. The first one will actually resolve ox as carboxymethyl because the last-ditch attempt at resolving in Pyteomics is currently a very permissive Unimod search; while the other will just raise an exception. In both files there is a Modifications column where you have the same form for the modification: Oxidation (M). Looks like if we remove the site, we can use the name and have a much better chance of getting a consistent ProForma. However I'm not sure how many other kinds of MaxQuant tables are out there.

levitsky avatar Nov 06 '25 15:11 levitsky

Hi Lev!

Because the MaxQuant modification labels used to be very non-specific (like the ox example), we opted to take over the labels as-is to the ProForma string. While this technically renders illegal ProForma entries, they are still parseable. In psm_utils, we then provide the rename_modifications method to convert the MaxQuant labels to legal proforma entries, like Unimod (also see the relevant section in the MS²Rescore docs.

When MaxQuant changed the labels to be more descriptive (like Oxidation (M), we considered automatically applying a mapping. Unfortunately, that would still not include user-defined modifications. Nevertheless, as such a mapping would cover most of the use cases, I'm open to reconsider.

Best, Ralf

RalfG avatar Nov 06 '25 18:11 RalfG

Thank you for the response!

It's good to know about the rename_modifications method, although an automatic mapping would definitely make things much easier. Is there a reason not to rely on the Modifications column?

Do you think psm_utils could attempt matching and removing the (site) part to get the modification name? This could fallback to current behavior silently and should do no harm?

levitsky avatar Nov 07 '25 11:11 levitsky

As far as I see, the Modifications column does not provide more info beyond what's in the modified peptide?

I think the best option would be to automatically apply a mapping that includes the default MaxQuant modification list, with some logging to notify the users of this action, and perhaps a warning when not all modifications could be mapped. The logging could be implemented as part of the PSMList-level rename_modifications method. Feel free to open a draft PR if you'd have the time.

RalfG avatar Nov 11 '25 17:11 RalfG

In my first example the Modifications column can be used as a source for the mapping. I have no experience using MaxQuant; perhaps there is a more exhaustive source for name mapping than that?

levitsky avatar Nov 12 '25 15:11 levitsky