mdtraj
mdtraj copied to clipboard
GLH residue code
Hi, I think the one-residue-code of GLH is wrong in mdtraj.core.residue_names._AMINO_ACID_CODES
:
https://github.com/mdtraj/mdtraj/blob/36b4cee61d936d9ecee875fbdde37120b59a4d51/mdtraj/core/residue_names.py#L107
It should be a protonated GLU (E), not a protonated GLN (Q).
Is there a canonical reference for these codes? It's all slightly confusing as a non-biophysicist
I agree that these naming conventions are confusing and can be different in different communities. I'd probably stay close to the definitions of amino acid residues given in the major force fields. E.g. GLH is defined in the amber force field (e.g. compare this line ). Unfortunately, I don't have a good canonical reference list or dictionary.
The three letter codes are defined by the PDB. The names used by Amber are nonstandard and conflict with the PDB definitions.
True. It seems, and maybe @peastman can confirm, that the mdtraj one-letter code definitions are actually taken from the PDB chemical component dictionary.
About the current case: The PDB's definition of GLH gives a one-latter code Q, GLN as parent comp id, and name "N-5-CYCLOHEXYL-N-5-[(CYCLOHEXYLAMINO)CARBONYL]GLUTAMINE". So this isn't a different protonation state of a standard amino acid but a more complex chemical modification. In my experience, if you open a random MD simulation, the chances are pretty low that a residue named GLH actually refers to this, and very high that it's a protonated GLU from an amber-based simulation.
This means that the output of traj.topology.to_fasta()
is very likely wrong if there is protonated residues with amber names. Adding to the confusion, some of the names used by amber are not listed, but are mapped to ''
one-letter codes, silently producing sequences that are shorter than the number of amino acids in the protein. Not being listed also makes them be classified as not protein
in selections.
the mdtraj one-letter code definitions are actually taken from the PDB chemical component dictionary.
Correct - this allows us to follow a fairly robust standard to any possible force field used and be compatible with experimental structural biology standards. Especially since, as stated above by Peter:
The names used by Amber are nonstandard and conflict with the PDB definitions.
It looks like we do support the protonated form in residue_names.py
already, so if the residue is specified as GLH in the topology during loading, it'll probably be fine? I haven't tested it before with amber engine files.