mdtraj GLH residue code

Hi, I think the one-residue-code of GLH is wrong in mdtraj.core.residue_names._AMINO_ACID_CODES: https://github.com/mdtraj/mdtraj/blob/36b4cee61d936d9ecee875fbdde37120b59a4d51/mdtraj/core/residue_names.py#L107

It should be a protonated GLU (E), not a protonated GLN (Q).

Feb 27 '24 13:02 thempel

Is there a canonical reference for these codes? It's all slightly confusing as a non-biophysicist

Feb 27 '24 13:02 mattwthompson

I agree that these naming conventions are confusing and can be different in different communities. I'd probably stay close to the definitions of amino acid residues given in the major force fields. E.g. GLH is defined in the amber force field (e.g. compare this line ). Unfortunately, I don't have a good canonical reference list or dictionary.

Feb 27 '24 17:02 thempel

The three letter codes are defined by the PDB. The names used by Amber are nonstandard and conflict with the PDB definitions.

Feb 27 '24 17:02 peastman

True. It seems, and maybe @peastman can confirm, that the mdtraj one-letter code definitions are actually taken from the PDB chemical component dictionary.

About the current case: The PDB's definition of GLH gives a one-latter code Q, GLN as parent comp id, and name "N-5-CYCLOHEXYL-N-5-[(CYCLOHEXYLAMINO)CARBONYL]GLUTAMINE". So this isn't a different protonation state of a standard amino acid but a more complex chemical modification. In my experience, if you open a random MD simulation, the chances are pretty low that a residue named GLH actually refers to this, and very high that it's a protonated GLU from an amber-based simulation.

This means that the output of traj.topology.to_fasta() is very likely wrong if there is protonated residues with amber names. Adding to the confusion, some of the names used by amber are not listed, but are mapped to '' one-letter codes, silently producing sequences that are shorter than the number of amino acids in the protein. Not being listed also makes them be classified as not protein in selections.

Feb 28 '24 16:02 thempel

the mdtraj one-letter code definitions are actually taken from the PDB chemical component dictionary.

Correct - this allows us to follow a fairly robust standard to any possible force field used and be compatible with experimental structural biology standards. Especially since, as stated above by Peter:

The names used by Amber are nonstandard and conflict with the PDB definitions.

It looks like we do support the protonated form in residue_names.py already, so if the residue is specified as GLH in the topology during loading, it'll probably be fine? I haven't tested it before with amber engine files.

Jun 02 '24 18:06 sukritsingh

mdtraj mdtraj copied to clipboard

GLH residue code

mdtraj
mdtraj copied to clipboard