mzLib
mzLib copied to clipboard
Certain modifications are kept in features other than "modified residue"
Two cases I've found (so far)
- ptmlist.txt file has mods with PP as "Protein core" - this isn't a case looked for
- A lipidation mod in the UniProt .xml is feature type "lipid moiety-binding region" instead of "modified residue"
Nice find, especially on the lipidation thing!
We could just change if (FeatureType == "modified residue")
to if (FeatureType == "modified residue" || FeatureType == "lipid moiety-binding region")
in `ProteinXmlEntry for a quick fix. These would be labeled "modified residues" during writing, but that seems okay.
Can "Protein core" be any amino acid in the protein, even on the termini? I can't find a definition at uniprot. Google says that protein core is simply solvent inaccessible. Don't think that necessarily precludes termini (but possibly since charged?)
There are modifications that only occur in the protein core? That's kind of fascinating if true.
I guess it makes sense that some of these reactive side chains would only be stable when protected in the core of the protein, like 4-thiazolecarboxylic acid and 2,3-didehydroalanine.
if we were to use one of these mods in a gptmd scenerio, what would the rules be?
I think we'd treat them as any other mod, right? Being in the protein core is (to me) only relevant in a secondary+ structural sense. I guess we're assuming any proteins run through our software are 1) denatured and then digested (BU), 2) intact-mass (irrelevant where the mod is), or 3) denatured and then shot on the mass spec (TD), 4) not denatured and shot on the mass spec (native). Doesn't this only apply for native MS?
I did an analysis of "feature type" in human canonical uniprot.xml and got the following table:
lipid-binding is pretty far down the list (but fine). I guess, we need to look at these and decide what to do. I suspect there are other things we'd like to have.
this command handy for examining large file:
grep "feature type=" xml.xml > ft.txt
dumps every line containing "feature type=" to new text file ft.txt
Nice analysis. Thanks, Shortreed.
We could also take "metal ion-binding site" into account.
EDIT: an example:
Are the "non-standard amino acid" features all selenocysteine? Are these symbols included in the sequence?
EDIT: yes, it looks like they're all selenocysteine for Homo sapiens. There's also this interesting preceding site feature in one instance:
EDIT: yes, it looks like they're also in the sequences.
no idea
Is "non-terminal residue" from circular peptides? And what the heck is the singular "non-consecutive residues" feature?
This non-consecutive definition is pretty vague: https://www.uniprot.org/help/non_cons
recent conversation with tal fellers makes me think that if we read a "modified residue" from uniprot.xml and we don't have a matching modification that we should provide an error message (rather than skip automatically). For e.g.: MM cannot interpret modification '(2S)-4-hydroxyleucine' in protein P12345 from human_protein_canonical.xml
An error as opposed to a warning?
I moved that discussion to another issue https://github.com/smith-chem-wisc/mzLib/issues/417, since it's kind of separate from this one.