mdanalysis
mdanalysis copied to clipboard
Consolidate TopologyAttrs with consistent meanings
TopologyAttrs are a bit of a mess right now, and the documentation is not up-to-date. This is a list of some inconsistencies I found.
atomiccharge
The GAMESS parser uses the TopologyAttr atomiccharge for the atomic number.
>>> from MDAnalysis.tests.datafiles import *
>>> gms = mda.Universe(GMS_ASYMOPT)
>>> gms.atoms
<AtomGroup with 6 atoms>
>>> g = gms.atoms[0]
>>> g.atomiccharge
8
>>> g.name
'O'
>>> g.type
'O'
>>> g.element
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/lily/anaconda3/envs/mdanalysis/lib/python3.7/site-packages/MDAnalysis/core/groups.py", line 3566, in __getattr__
cls=self.__class__.__name__, attr=attr))
AttributeError: Atom has no attribute element
This is technically correct, but I think most people would expect the atomiccharge to refer to a partial atomic charge. A more informational name might be atomicnumber.
atomnum
The Desmond DMS parser uses the TopologyAttr atomnum. The Desmond User Guide (section 17.1.1) asserts that 'anum' refers to the atomic number of a particle, but this seems implausible:
>>> from MDAnalysis.tests.datafiles import DMS
>>> dms = mda.Universe(DMS)
>>> dms.atoms.names
array(['N', 'HT1', 'HT2', ..., 'C', 'OT1', 'OT2'], dtype=object)
>>> dms.atoms.atomnums
array([0, 0, 0, ..., 0, 0, 0], dtype=int32)
element vs type
A non-exhaustive search indicates that only the prmtop TOPParser implements the element TopologyAttr.
>>> prm = mda.Universe(PRM)
>>> prm.atoms[:6].names
array(['N', 'H1', 'H2', 'H3', 'CA', 'HA'], dtype=object)
>>> prm.atoms[:6].types
array(['N3', 'H', 'H', 'H', 'CT', 'HP'], dtype=object)
>>> prm.atoms[:6].elements
array(['N', 'H', 'H', 'H', 'C', 'H'], dtype=object)
I looked at formats like GAMESS and XYZ that as far as I know explicitly state element, as well as formats like PDB that either read an element column or guess it from the name. type alternately refers to elements:
>>> pdb = mda.Universe(PDB)
>>> pdb.atoms.types
array(['N', 'H', 'H', ..., 'NA', 'NA', 'NA'], dtype=object)
>>> pdb.atoms.names
array(['N', 'H1', 'H2', ..., 'NA', 'NA', 'NA'], dtype=object)
or to force field / Autodock atom types:
>>> pdbqt = mda.Universe(PDBQT_input)
>>> pdbqt.atoms.types
array(['N', 'HD', 'HD', ..., 'C', 'C', 'OA'], dtype=object)
>>> pdbqt.atoms.names
array(['N', 'HN1', 'HN2', ..., 'CA', 'C', 'O'], dtype=object)
>>> pdbqt.atoms.masses
array([14.007, 0. , 0. , ..., 12.011, 12.011, 0. ])
>>> pdbqt.atoms.elements
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/lily/anaconda3/envs/mdanalysis/lib/python3.7/site-packages/MDAnalysis/core/groups.py", line 2278, in __getattr__
cls=self.__class__.__name__, attr=attr))
AttributeError: AtomGroup has no attribute elements
My opinion is also that the 0 mass for the Hs and O should have raised a warning.
resid vs resnum
I can't see a difference between resnum and resid.
>>> crd = mda.Universe(CRD)
>>> crd.atoms.resnums
array([ 1, 1, 1, ..., 214, 214, 214])
>>> crd.atoms.resids
array([ 1, 1, 1, ..., 214, 214, 214])
>>> tpr = mda.Universe(TPR)
>>> tpr.atoms.resnums
array([ 0, 0, 0, ..., 11299, 11300, 11301])
>>> tpr.atoms.resids
array([ 0, 0, 0, ..., 11299, 11300, 11301])
GROMACS indexes from 1 in user-readable files, as well.
chainIDs
The Desmond parser appears to be the only one to implement the chainID TopologyAttr.
>>> dms.atoms.chainIDs
array(['X', 'X', 'X', ..., 'X', 'X', 'X'], dtype=object)
Comments in the code suggest that all other parsers turn chains into segments.
tempfactors vs bfactors
These parsers have tempfactors but not bfactors:
- PDB
- PDBx
- PDBQT
- CRD
MMTFParser has bfactors but not tempfactors.
>>> mmtf.atoms[:6].bfactors
array([ 9.48, 10.88, 10.88, 10.88, 10.88, 9.48])
TPRParser has neither bfactors nor tempfactors, despite the documentation thinking it does.
This impacts on selection language. Selecting same bfactor as *selection* only works for MMTFs because selection language doesn't recognise the tempfactor attribute.
ids
I don't understand the meaning of atom IDs for file formats like XYZ. TPR atom IDs also start from 0, despite GROMACS indexing from 1.
>>> tpr = mda.Universe(TPR)
>>> tpr.atoms.ids
array([ 0, 1, 2, ..., 47678, 47679, 47680])
masses
Failing to guess masses gives masses of 0. This is done atom-by-atom, so users may only check the end of an array like the one below, not realising that they do not have appropriate masses elsewhere. This seems like a bad idea.
>>> mol2.atoms.masses
array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 18.998, 0. , 35.45 , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 1.008, 1.008,
1.008, 1.008, 1.008, 1.008, 1.008, 1.008, 1.008, 1.008,
1.008, 1.008, 1.008, 1.008, 1.008, 1.008, 1.008, 1.008,
1.008])
resnames
The GSDParser has numbers as resnames. Should this be the case?
>>> gsd = mda.Universe(GSD)
>>> gsd.atoms.resnames
array([0, 1, 2, ..., 647, 647, 647], dtype=object)
>>> type(gsd.atoms[0].resname)
<class 'int'>
sigh
Thanks for documenting this. It is really helpful that you're taking a global view.
>>> gsd = mda.Universe(GSD)
>>> gsd.atoms.resnames
array([0, 1, 2, ..., 647, 647, 647], dtype=object)
>>> gsd.select_atoms("resname 0")
<AtomGroup with 0 atoms>
>>> gsd.select_atoms("resname 2")
<AtomGroup with 0 atoms>
^ A consequence of integer resnames.
Ew gross. So one option is to make some “reserved” topogyattrs statically typed.
On Wed, Oct 23, 2019 at 16:07, Lily Wang [email protected] wrote:
gsd = mda.Universe(GSD) gsd.atoms.resnames array([0, 1, 2, ..., 647, 647, 647], dtype=object) gsd.select_atoms("resname 0") <AtomGroup with 0 atoms> gsd.select_atoms("resname 2") <AtomGroup with 0 atoms>
^ A consequence of integer resnames.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/2362?email_source=notifications&email_token=ACGSGB4W4MMEQQEIFIH4ZULQQBSCBA5CNFSM4I7MLWZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECBYFAQ#issuecomment-545489538, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB2PUVNK6Y2OZIYAWSLQQBSCBANCNFSM4I7MLWZA .
This would be cool to address in 3.0 as it "breaks" things, but it needs fixing (ping #3800)
Consolidating attributes looks like a project to be put on the 3.0 roadmap as it is related to
- Improve interoperability with chemoinformatics and simulation software.
- Improve guessers to enable context-dependent chemical information to be easily inferred.
With PR #3753 almost done, we will have a new guesser infrastructure. We should then clean-up our attributes.
And by "putting on the roadmap" I mean "We need to figure out how we will make it actually happen." — something for the next dev meeting.