mdanalysis
mdanalysis copied to clipboard
guess_masses returning 0 makes no sense
Expected behavior
The center_of_mass method is only accessible when information about masses is present.
Actual behavior
Due to the fact that the guesser returns a mass of 0 for atoms it does not recognise, some parsers will yield topologies with dangerous mass information.
Code to reproduce the behavior
An example with the GROParser, but the principle applies to all parsers with mass guessing.
$ cat some_weird_atoms.gro
Test system with weird atoms
3
1SURF Al 1 2.884 1.394 0.508 -0.1400 0.2476 0.1342
1SURF Si 2 0.086 2.245 0.814 -0.2134 0.2260 0.1815
1SURF XY 3 0.068 1.569 0.811 0.2462 0.7906 0.0157
2.89202 3.01276 2.54300
$ cat only_weird_atoms.gro
Test system with weird atoms
3
1SURF RW 1 2.884 1.394 0.508 -0.1400 0.2476 0.1342
1SURF RW 2 0.086 2.245 0.814 -0.2134 0.2260 0.1815
1SURF RW 3 0.068 1.569 0.811 0.2462 0.7906 0.0157
2.89202 3.01276 2.54300
>>> import MDAnalysis as mda
>>> u = mda.Universe('some_weird_atoms.gro')
/Users/henrik/mdanalysis/package/MDAnalysis/topology/guessers.py:146: UserWarning: Failed to guess the mass for the following atom types: AL
warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
/Users/henrik/mdanalysis/package/MDAnalysis/topology/guessers.py:146: UserWarning: Failed to guess the mass for the following atom types: SI
warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
/Users/henrik/mdanalysis/package/MDAnalysis/topology/guessers.py:146: UserWarning: Failed to guess the mass for the following atom types: X
warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
>>> u.atoms.masses
array([0., 0., 0.])
>>> u = mda.Universe('some_weird_atoms.gro')
>>> u.atoms.masses
array([1.008, 1.008, 0. ])
>>> u.atoms.center_of_mass()
array([14.85000008, 18.19499969, 6.61000013])
>>> u = mda.Universe('only_weird_atoms.gro')
/Users/henrik/mdanalysis/package/MDAnalysis/topology/guessers.py:146: UserWarning: Failed to guess the mass for the following atom types: R
warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
>>> u.atoms.masses
array([0., 0., 0.])
>>> u.atoms.center_of_mass()
/Users/henrik/mdanalysis/package/MDAnalysis/core/groups.py:926: RuntimeWarning: invalid value encountered in true_divide
return (coords * weights[:, None]).sum(axis=0) / weights.sum()
array([nan, nan, nan])
The first COM is just wrong, the second one is actually better, because the user gets a warning of sorts that they are working with weird data.
Why would the guesser return 0 for unknown atoms? I believe it should return an empty value so the parser can disregard the mass attribute altogether, instead of importing "almost right" masses.
Keep in mind that in some instances users don't see the guesser warnings and assume everything worked correctly.
TLDR: the mass guesser should not return 0 for unknown atom types.
Current version of MDAnalysis
$ python3 -c "import MDAnalysis as mda; print(mda.__version__)"
2.0.0-dev0
$ python -V
Python 2.7.16
$ sw_vers
ProductName: macOS
ProductVersion: 11.2.1
BuildVersion: 20D75
pseudo-duplicate of #2362 (specifically the masses section of the first comment)?
Oh I didn't see that one. Yeah that's basically it.
With the addition that guessing is broken in a lot of situations. Mass guessing is broken in the ITPParser (#3199 fixes that) und it seems to be broken in the GROParser as well (see above, Al and Si should easily be guessed).
For Al & Si, the particular problem with the GRO parser is that we're guessing on a guess. Especially with the atomtype guesser being so inaccurate, you end up with a guess on nonsense.
In my view, we should just drop the need for mass to be a required attribute (we shouldn't be guessing anyways), but that will require a deep look at what core components requires masses.
I think @jbarnoud and @lilyminium have some views on all this.
Note; there was a proposal in #2630 to add context to guessers to improve the situation too.
Also going to link #2918 since that aims to avoid type guessing in cases where we can avoid it - [Note to self; try to get that done for 2.0 once we've finished with the gsoc applications & 1.1], which should guarantee accurate mass guessing for those readers.
A little off topic, but really we shouldn't be guessing anything by default. That way the guessing is a conscious decision by the user and they're aware of how/why they've guessed stuff.
For this issue I think maybe np.nan would be a better default value, but I'd have to check what that breaks downstream.
I'm going to weigh in on this as a user that was pretty frustrated about the guessing:
What I found especially awkward about the ITPParser was that it managed to get some of the information, but then guessed other info.
Users may have certain expectations about what the parsers do:
My expectation would be that:
- ITP topology with all attributes defined -> Everything gets parsed as I defined it
- GRO file (with basically nothing defined) -> Just the positions get parsed
Instead my experience was:
- ITP topology -> stuff gets guessed but pretty bad
- GRO file -> Stuff get guessed but pretty bad
My sentiment is this: If one gets a (intact and whole) topology from a MD simulation, attributes like charges and masses should be completely defined, with no need to guess parts of those values. If this is not the case, either the parser is not capable enough (See the discussion in #3199 , all info is in the itp files, no need to guess anything), or the topology is broken (missing include files etc..). So guessing of attributes that will later be used in calculations (i.e. carge and mass) should never be the default option.
Please correct me if this is not true in all cases.
The third case are file formats like GRO files. Here, all basically everything needs to be guessed because its not a topology but rather just a fancy/ugly trajectory. In this case I believe it is ok to require the user to go the extra mile if they really want those attributes that are not present in the configuration they provided.
I hope I don't sound too harsh. I see this is a discussion that is already going on for quite some time.
I was trying to remember where the main discussion for this was, here it is: https://github.com/MDAnalysis/mdanalysis/issues/2348
I hope I don't sound too harsh. I see this is a discussion that is already going on for quite some time.
It's pretty much the same view (i.e. don't guess by default) as what we aim to eventually do, so I wouldn't consider it harsh at all.
TLDR; As far as I remember, currently there are two enforced attributes (that require guessing if not present), atomtype and mass. The former is somewhat confusing and we essentially need to do away with it (or at least make it clearer as to how it's assigned - there's other issues / PRs about this somewhere). The latter is a design decision before my time, but as @richardjgowers mentioned, it's a case of finding out all the bits that require mass and making them fail in an adequate manner if masses are unknown.
I think some of this was slated for v2.0, but it's a pretty big change that will need a least one version's worth of deprecation (plus we really need to get 2.0 out earlier than later). We might aim for this in 2.1, the only limiting factor here is just a case of finding enough time to get it all done.
I see, this entails a lot of things...
Might I propose for the 2.0 release to throw a warning every time a guesser guesses a value, thereby making clear that the user might be doing something they do not want. (basically #2222 )
A crazier idea I had (basically #2348 with a twist): Introduce a parameter guess in Universe that for the time being defaults to something like ['type','bond','mass']. This would allow to developers of tools using MDA to set for example guess=['type','bond'] when importing data that needs proper masses, so that an error is thrown. Later versions could gradually change the default behaviour of guess to match the state of the element handling.
Might I propose for the 2.0 release to throw a warning every time a guesser guesses a value, thereby making clear that the user might be doing something they do not want. (basically #2222 )
I'll let the other @MDAnalysis/coredevs weigh in here, but I'm not super into the idea of the sheer number of warnings we'd have to emit just because we're parsing a file and passing it through an atomtype/mass guesser. There's a point where we drown users in too many warnings. We already warn on a failure to guess types/masses, which as far as I understand covers the current failure cases?
Separate thought here; I'd also like to get 2.0 out of the door ASAP after 1.1 gets done (just the one thing left to deal with and we can release 1.1). So I'm not super into the idea of adding to the 2.0 milestone since there's already a lot to get done before the end of May. If folks vote to add to the milestone, I will ask that they be made responsible for making sure it gets delivered.
I hate to say it but 2.0 would be the ideal point in time to stop guessing. We can then gradually re-introduce means to derive information in better controlled manner.
I don't think that this would require specific deprecations/warnings in 1.x – it will simply be a big breaking change that comes with 2.0. People who rely on the haphazard guessing will have to use 1.1 for a while until, say 2.1, introduces functionality for intentional guessing.
(I am certainly responsible for some of the early decisions regarding guessing, namely trying to do the right thing in typical situations to make the user experience seamless. With the broader use of the library this has become untenable. And it doesn't help that some parsers don't parse all information that they can. I think one of the lessons over the last 10 years is that stuff that seemed like a good idea in a limited context easily breaks when you have many users using tools in ways that you didn't envisage – CG models certainly weren't a big use area early on. The other obvious corollary is that one has to be careful with balancing technical debt vs. satisfying users/keeping up with the competition "that just does it" when adding a feature or fix that seems to solve a limited problem.)
I hate to say it but 2.0 would be the ideal point in time to stop guessing. We can then gradually re-introduce means to derive information in better controlled manner.
Given today's moving up of the 2.0 deadline, is this still your current opinion @orbeckst? If so, do you mind if I put you in charge of mobilizing a solution here?
Passing the buck is fair.
It would be a missed opportunity not to break this with 2.0. I'll have a look.
I am looking at Project: consistent element handling. This looks pretty daunting. A few notes here while I am reviewing current efforts.
- Looking at draft PR #3218, type guessing can be removed easily, but mass guessing will then leave a pretty big hole. Mass guessing would be reliable (for atomistic systems) if we had elements.
- @lilyminium has good summary in #2630
- @zemanj 's https://github.com/MDAnalysis/mdanalysis/issues/2630#issuecomment-599784226 tl;dr: We should not guess anything by default.
I don't think that it's realistic to reliably change the way MDA arrives at types, elements, masses over the next few days (i.e., until the 2.0 deadline). Not only would we need to have our own tests pass, there are also examples in the User Guide etc that need to be checked.
Breaking the guessing for 2.0 would be good but I don't see this as a realistic goal anymore.
If there are band-aids (such as for this particular issue where we have mass = 0 contributing) then we can do those (or do them in patch releases).
@aya9aladdin does your PR #3753 address this issue?
This issue is a Pending depreciation issue for not breaking any old behavior. In PR https://github.com/MDAnalysis/mdanalysis/pull/3753 there is a warning about populating unknown masses with 0 value, which will be replaced in future versions by Mass's is_value_missing attribute which is equal to np.nan
Thanks for explaining, so PR #3753 is not fixing this issue. A fix will have to wait for 3.0 when we can break the current behavior.