mdanalysis Consolidate TopologyAttrs with consistent meanings

trafficstars

TopologyAttrs are a bit of a mess right now, and the documentation is not up-to-date. This is a list of some inconsistencies I found.

atomiccharge

The GAMESS parser uses the TopologyAttr atomiccharge for the atomic number.

>>> from MDAnalysis.tests.datafiles import *
>>> gms = mda.Universe(GMS_ASYMOPT)
>>> gms.atoms
<AtomGroup with 6 atoms>
>>> g = gms.atoms[0]
>>> g.atomiccharge
8
>>> g.name
'O'
>>> g.type
'O'
>>> g.element
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lily/anaconda3/envs/mdanalysis/lib/python3.7/site-packages/MDAnalysis/core/groups.py", line 3566, in __getattr__
    cls=self.__class__.__name__, attr=attr))
AttributeError: Atom has no attribute element

This is technically correct, but I think most people would expect the atomiccharge to refer to a partial atomic charge. A more informational name might be atomicnumber.

atomnum

The Desmond DMS parser uses the TopologyAttr atomnum. The Desmond User Guide (section 17.1.1) asserts that 'anum' refers to the atomic number of a particle, but this seems implausible:

>>> from MDAnalysis.tests.datafiles import DMS
>>> dms = mda.Universe(DMS)
>>> dms.atoms.names
array(['N', 'HT1', 'HT2', ..., 'C', 'OT1', 'OT2'], dtype=object)
>>> dms.atoms.atomnums
array([0, 0, 0, ..., 0, 0, 0], dtype=int32)

element vs type

A non-exhaustive search indicates that only the prmtop TOPParser implements the element TopologyAttr.

>>> prm = mda.Universe(PRM)
>>> prm.atoms[:6].names
array(['N', 'H1', 'H2', 'H3', 'CA', 'HA'], dtype=object)
>>> prm.atoms[:6].types
array(['N3', 'H', 'H', 'H', 'CT', 'HP'], dtype=object)
>>> prm.atoms[:6].elements
array(['N', 'H', 'H', 'H', 'C', 'H'], dtype=object)

I looked at formats like GAMESS and XYZ that as far as I know explicitly state element, as well as formats like PDB that either read an element column or guess it from the name. type alternately refers to elements:

>>> pdb = mda.Universe(PDB)
>>> pdb.atoms.types
array(['N', 'H', 'H', ..., 'NA', 'NA', 'NA'], dtype=object)
>>> pdb.atoms.names
array(['N', 'H1', 'H2', ..., 'NA', 'NA', 'NA'], dtype=object)

or to force field / Autodock atom types:

>>> pdbqt = mda.Universe(PDBQT_input)
>>> pdbqt.atoms.types
array(['N', 'HD', 'HD', ..., 'C', 'C', 'OA'], dtype=object)
>>> pdbqt.atoms.names
array(['N', 'HN1', 'HN2', ..., 'CA', 'C', 'O'], dtype=object)
>>> pdbqt.atoms.masses
array([14.007,  0.   ,  0.   , ..., 12.011, 12.011,  0.   ])
>>> pdbqt.atoms.elements
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lily/anaconda3/envs/mdanalysis/lib/python3.7/site-packages/MDAnalysis/core/groups.py", line 2278, in __getattr__
    cls=self.__class__.__name__, attr=attr))
AttributeError: AtomGroup has no attribute elements

My opinion is also that the 0 mass for the Hs and O should have raised a warning.

resid vs resnum

I can't see a difference between resnum and resid.

>>> crd = mda.Universe(CRD)
>>> crd.atoms.resnums
array([  1,   1,   1, ..., 214, 214, 214])
>>> crd.atoms.resids
array([  1,   1,   1, ..., 214, 214, 214])
>>> tpr = mda.Universe(TPR)
>>> tpr.atoms.resnums
array([    0,     0,     0, ..., 11299, 11300, 11301])
>>> tpr.atoms.resids
array([    0,     0,     0, ..., 11299, 11300, 11301])

GROMACS indexes from 1 in user-readable files, as well.

chainIDs

The Desmond parser appears to be the only one to implement the chainID TopologyAttr.

>>> dms.atoms.chainIDs
array(['X', 'X', 'X', ..., 'X', 'X', 'X'], dtype=object)

Comments in the code suggest that all other parsers turn chains into segments.

tempfactors vs bfactors

These parsers have tempfactors but not bfactors:

PDB
PDBx
PDBQT
CRD

MMTFParser has bfactors but not tempfactors.

>>> mmtf.atoms[:6].bfactors
array([ 9.48, 10.88, 10.88, 10.88, 10.88,  9.48])

TPRParser has neither bfactors nor tempfactors, despite the documentation thinking it does.

This impacts on selection language. Selecting same bfactor as *selection* only works for MMTFs because selection language doesn't recognise the tempfactor attribute.

ids

I don't understand the meaning of atom IDs for file formats like XYZ. TPR atom IDs also start from 0, despite GROMACS indexing from 1.

>>> tpr = mda.Universe(TPR)
>>> tpr.atoms.ids
array([    0,     1,     2, ..., 47678, 47679, 47680])

masses

Failing to guess masses gives masses of 0. This is done atom-by-atom, so users may only check the end of an array like the one below, not realising that they do not have appropriate masses elsewhere. This seems like a bad idea.

>>> mol2.atoms.masses
array([ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   , 18.998,  0.   , 35.45 ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  1.008,  1.008,
        1.008,  1.008,  1.008,  1.008,  1.008,  1.008,  1.008,  1.008,
        1.008,  1.008,  1.008,  1.008,  1.008,  1.008,  1.008,  1.008,
        1.008])

resnames

The GSDParser has numbers as resnames. Should this be the case?

>>> gsd = mda.Universe(GSD)
>>> gsd.atoms.resnames
array([0, 1, 2, ..., 647, 647, 647], dtype=object)
>>> type(gsd.atoms[0].resname)
<class 'int'>

Oct 10 '19 12:10 lilyminium

sigh

Thanks for documenting this. It is really helpful that you're taking a global view.

Oct 19 '19 00:10 orbeckst

>>> gsd = mda.Universe(GSD)
>>> gsd.atoms.resnames
array([0, 1, 2, ..., 647, 647, 647], dtype=object)
>>> gsd.select_atoms("resname 0")
<AtomGroup with 0 atoms>
>>> gsd.select_atoms("resname 2")
<AtomGroup with 0 atoms>

^ A consequence of integer resnames.

Oct 23 '19 15:10 lilyminium

Ew gross. So one option is to make some “reserved” topogyattrs statically typed.

On Wed, Oct 23, 2019 at 16:07, Lily Wang [email protected] wrote:

gsd = mda.Universe(GSD) gsd.atoms.resnames array([0, 1, 2, ..., 647, 647, 647], dtype=object) gsd.select_atoms("resname 0") <AtomGroup with 0 atoms> gsd.select_atoms("resname 2") <AtomGroup with 0 atoms>

^ A consequence of integer resnames.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/2362?email_source=notifications&email_token=ACGSGB4W4MMEQQEIFIH4ZULQQBSCBA5CNFSM4I7MLWZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECBYFAQ#issuecomment-545489538, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB2PUVNK6Y2OZIYAWSLQQBSCBANCNFSM4I7MLWZA .

Oct 27 '19 17:10 richardjgowers

This would be cool to address in 3.0 as it "breaks" things, but it needs fixing (ping #3800)

Aug 31 '22 17:08 richardjgowers

Consolidating attributes looks like a project to be put on the 3.0 roadmap as it is related to

Improve interoperability with chemoinformatics and simulation software.
Improve guessers to enable context-dependent chemical information to be easily inferred.

With PR #3753 almost done, we will have a new guesser infrastructure. We should then clean-up our attributes.

Mar 28 '24 17:03 orbeckst

And by "putting on the roadmap" I mean "We need to figure out how we will make it actually happen." — something for the next dev meeting.

Mar 28 '24 17:03 orbeckst

mdanalysis mdanalysis copied to clipboard

Consolidate TopologyAttrs with consistent meanings

atomiccharge

atomnum

element vs type

resid vs resnum

chainIDs

tempfactors vs bfactors

ids

masses

resnames

mdanalysis
mdanalysis copied to clipboard