psm_utils icon indicating copy to clipboard operation
psm_utils copied to clipboard

pepXML modifications are offset by one

Open nnalpas opened this issue 1 year ago • 0 comments

Hi, I think there is an issue in parsing peptidoform for pepXML file.

in this peptide hit exemple :

<search_hit peptide="AHTMVHDQVSR" massdiff="-6.103515625E-4" calc_neutral_pep_mass="1295.604" peptide_next_aa="F" num_missed_cleavages="0" num_tol_term="2" protein_descr="gene=gltA;locus_tag=19A2747_02138;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P14165;product=Citrate synthase" num_tot_proteins="1" tot_num_ions="20" hit_rank="1" num_matched_ions="6" protein="19A2747_02138_gene" peptide_prev_aa="R" is_rejected="0"> <modification_info modified_peptide="AHTM[147.0354]VHDQVSR"> <mod_aminoacid_mass mass="147.0354" position="4"/> </modification_info> <search_score name="hyperscore" value="15.15"/> <search_score name="nextscore" value="0.0"/> <search_score name="expect" value="3.868121e-04"/> </search_hit>

the psm_utils.io.read_file command returns:

AHTMV[+147.0354]HDQVSR/3

The oxidation(M) on position 4 is offset to position 5.

This might be due to the modification parsing occuring in the function "_parse_peptidoform"; specifically the line sequence = [(aa, modifications_dict[i] or None) for i, aa in enumerate(peptide)] I could be wrong but I think, this should be: sequence = [(aa, modifications_dict[i+1] or None) for i, aa in enumerate(peptide)]

I hope this helps. Thanks,

nnalpas avatar Oct 03 '24 14:10 nnalpas