mdanalysis
mdanalysis copied to clipboard
basic guesser features
Fixes #
Changes made in this Pull Request:
- I have created a guesser package that will contain all upcoming the context aware guessers. it contain the following so far:
a core module that only contain one method that will be used by the guess_topologyAttribute method of the universe a base module that contains the
BaseGusserclass that will be inherited by every guesser a DefaultGuesser module which will handle the currently existing guess methods (for now, I just added the mass guessing to test with one attribute guessing method)
- I made changes to the universe class to work with the new methodology
- I removed the mass guessing from the PDBParser to test guessing it at the universe level
PR Checklist
- [x] Tests?
- [x] Docs?
- [x] CHANGELOG updated?
- [x] Issue raised/referenced?
Hello @aya9aladdin! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
- In the file
package/MDAnalysis/converters/OpenMMParser.py:
Line 184:80: E501 line too long (101 > 79 characters) Line 185:80: E501 line too long (118 > 79 characters) Line 197:80: E501 line too long (107 > 79 characters) Line 200:80: E501 line too long (90 > 79 characters) Line 202:1: W293 blank line contains whitespace
- In the file
package/MDAnalysis/core/groups.py:
Line 3298:1: W293 blank line contains whitespace
- In the file
package/MDAnalysis/core/universe.py:
Line 387:80: E501 line too long (90 > 79 characters) Line 589:80: E501 line too long (85 > 79 characters) Line 1464:80: E501 line too long (92 > 79 characters) Line 1474:80: E501 line too long (81 > 79 characters) Line 1474:82: W291 trailing whitespace Line 1476:1: W293 blank line contains whitespace Line 1480:80: E501 line too long (87 > 79 characters) Line 1491:1: W293 blank line contains whitespace Line 1492:79: W291 trailing whitespace Line 1498:1: W293 blank line contains whitespace Line 1507:80: E501 line too long (126 > 79 characters) Line 1510:52: W291 trailing whitespace Line 1522:80: E501 line too long (83 > 79 characters) Line 1525:13: E303 too many blank lines (2) Line 1528:35: E127 continuation line over-indented for visual indent Line 1528:52: W291 trailing whitespace Line 1532:30: E127 continuation line over-indented for visual indent
- In the file
package/MDAnalysis/guesser/base.py:
Line 48:80: E501 line too long (98 > 79 characters) Line 59:1: W293 blank line contains whitespace Line 61:1: W293 blank line contains whitespace Line 69:80: E501 line too long (89 > 79 characters) Line 70:80: E501 line too long (104 > 79 characters) Line 71:80: E501 line too long (92 > 79 characters) Line 71:93: W291 trailing whitespace Line 73:1: W293 blank line contains whitespace Line 85:1: W293 blank line contains whitespace Line 117:1: W293 blank line contains whitespace Line 133:32: W291 trailing whitespace Line 141:1: W293 blank line contains whitespace
- In the file
package/MDAnalysis/guesser/default_guesser.py:
Line 32:1: E302 expected 2 blank lines, found 1 Line 34:80: E501 line too long (115 > 79 characters) Line 34:116: W291 trailing whitespace Line 35:80: E501 line too long (104 > 79 characters) Line 37:14: W291 trailing whitespace Line 45:1: W293 blank line contains whitespace Line 46:80: E501 line too long (86 > 79 characters) Line 52:1: W293 blank line contains whitespace Line 57:1: W293 blank line contains whitespace Line 65:34: E203 whitespace before ':' Line 88:1: W293 blank line contains whitespace Line 95:80: E501 line too long (84 > 79 characters) Line 120:80: E501 line too long (112 > 79 characters) Line 156:14: E225 missing whitespace around operator Line 159:14: W291 trailing whitespace Line 232:80: E501 line too long (80 > 79 characters) Line 234:80: E501 line too long (80 > 79 characters) Line 236:80: E501 line too long (80 > 79 characters) Line 240:80: E501 line too long (82 > 79 characters) Line 241:80: E501 line too long (80 > 79 characters) Line 242:80: E501 line too long (83 > 79 characters) Line 245:80: E501 line too long (80 > 79 characters) Line 246:80: E501 line too long (82 > 79 characters) Line 261:80: E501 line too long (80 > 79 characters) Line 282:80: E501 line too long (81 > 79 characters) Line 292:42: E713 test for membership should be 'not in' Line 319:5: E303 too many blank lines (2) Line 322:80: E501 line too long (93 > 79 characters) Line 324:80: E501 line too long (85 > 79 characters) Line 325:27: E128 continuation line under-indented for visual indent Line 325:80: E501 line too long (81 > 79 characters) Line 326:1: W293 blank line contains whitespace Line 355:80: E501 line too long (80 > 79 characters) Line 356:80: E501 line too long (83 > 79 characters) Line 365:80: E501 line too long (93 > 79 characters) Line 405:80: E501 line too long (94 > 79 characters) Line 406:1: W293 blank line contains whitespace Line 410:80: E501 line too long (81 > 79 characters) Line 437:20: E713 test for membership should be 'not in' Line 446:5: E303 too many blank lines (2) Line 455:5: E303 too many blank lines (2) Line 488:1: W293 blank line contains whitespace
- In the file
package/MDAnalysis/topology/PDBParser.py:
Line 334:80: E501 line too long (103 > 79 characters) Line 341:80: E501 line too long (108 > 79 characters)
- In the file
package/MDAnalysis/topology/TOPParser.py:
Line 293:80: E501 line too long (100 > 79 characters)
- In the file
package/MDAnalysis/topology/TXYZParser.py:
Line 122:16: E111 indentation is not a multiple of four
- In the file
testsuite/MDAnalysisTests/analysis/test_encore.py:
Line 618:80: E501 line too long (90 > 79 characters) Line 622:80: E501 line too long (105 > 79 characters) Line 624:1: W293 blank line contains whitespace Line 629:1: W293 blank line contains whitespace Line 903:9: E741 ambiguous variable name 'l' Line 904:80: E501 line too long (97 > 79 characters) Line 905:80: E501 line too long (91 > 79 characters)
Line 177:23: E127 continuation line over-indented for visual indent Line 179:80: E501 line too long (89 > 79 characters) Line 180:80: E501 line too long (106 > 79 characters) Line 205:80: E501 line too long (95 > 79 characters)
- In the file
testsuite/MDAnalysisTests/converters/test_rdkit.py:
Line 148:80: E501 line too long (83 > 79 characters)
- In the file
testsuite/MDAnalysisTests/core/test_universe.py:
Line 383:1: E302 expected 2 blank lines, found 1 Line 393:1: W293 blank line contains whitespace Line 400:13: E117 over-indented Line 403:1: W293 blank line contains whitespace Line 410:1: W293 blank line contains whitespace Line 518:5: E301 expected 1 blank line, found 0 Line 755:80: E501 line too long (86 > 79 characters) Line 1370:80: E501 line too long (86 > 79 characters) Line 1374:9: E303 too many blank lines (2)
- In the file
testsuite/MDAnalysisTests/guesser/test_base.py:
Line 29:1: E302 expected 2 blank lines, found 1 Line 30:1: W293 blank line contains whitespace Line 35:29: W291 trailing whitespace Line 38:1: W293 blank line contains whitespace Line 39:1: W391 blank line at end of file
Line 53:1: E302 expected 2 blank lines, found 1 Line 60:80: E501 line too long (93 > 79 characters) Line 63:5: E303 too many blank lines (2) Line 66:25: E201 whitespace after '(' Line 66:80: E501 line too long (91 > 79 characters) Line 73:80: E501 line too long (121 > 79 characters) Line 79:80: E501 line too long (80 > 79 characters) Line 90:80: E501 line too long (84 > 79 characters) Line 150:51: E231 missing whitespace after ',' Line 150:53: E231 missing whitespace after ',' Line 174:1: E302 expected 2 blank lines, found 1 Line 176:80: E501 line too long (92 > 79 characters) Line 178:27: E127 continuation line over-indented for visual indent Line 182:1: E302 expected 2 blank lines, found 1 Line 188:19: E127 continuation line over-indented for visual indent Line 190:1: E302 expected 2 blank lines, found 1 Line 196:19: E127 continuation line over-indented for visual indent Line 232:27: E127 continuation line over-indented for visual indent
- In the file
testsuite/MDAnalysisTests/topology/base.py:
- In the file
testsuite/MDAnalysisTests/topology/test_itp.py:
Line 249:1: W293 blank line contains whitespace
Line 259:1: W293 blank line contains whitespace
Comment last updated at 2022-09-12 19:40:16 UTC
@aya9aladdin I recommend that you push your changes often. Do not wait for the code to be finished. By pushing often, we get to see your progress, we get to comment on your latest version instead of a version that may already be obsolete, and we may catch issues earlier, reducing the amount of work you will have to redo.
@aya9aladdin I recommend that you push your changes often. Do not wait for the code to be finished. By pushing often, we get to see your progress, we get to comment on your latest version instead of a version that may already be obsolete, and we may catch issues earlier, reducing the amount of work you will have to redo.
I was planning to push everything when I finish indeed. I don't know if I have to make a pull request at this unmatured level or not?
You already have a pull request, no need to open another one. Instead, update this pull request by pushing your new commits.
I have made some updates:
1- I removed all mass and type guessing from parsers and transferred it to take place inside the universe initiation, so mass and atom types will still be guessed automatically until we are ready to remove it
2- some attributes depend on other attributes to be guessed. if those parent attributes also need to be guessed, this can raise unnecessary errors if we attempted to guess the child attribute before the parent one. So, I added a rank dictionary to the guesser class which will rank each attribute based on its dependency on parent attributes to be guessed. For example ,AtomType depend on atom name then it will have rank 1. While masses depends on atom type which depends on atom name then it will have rank 2. By using this rank atom types will be guessed then masses to avoid errors.
(I hard coded it for now in the DefaultGuesser for testing)
next update will have:
1-finishing the BaseGuesser
2- add guesser metaclass for registration of guessers and other required dictionaries
3- add AtomType attributes to be equal to Element for PDBParser, FHIAIMSParser, TXYZParser, and XYZParser parsers
I'm just more concerned with the logic than the documentation at the moment that's why it's not very accurate
I updated some files as follows:
1- I modified guess_bonds inside the universe to pass a context to it (honestly I didn't get why bond guessing happens inside the AtomGroup, not the universe, so I left it as it is just modified the code to pass the context)
2- I modified the RDKitParser, PDBParser, XYZParser, and FHIAMSParser to add AtomType if it exists, and if not assign it to the elements then names attributes
3- at the point most of the DefualtGuesser is considered complete
4- I added a GuesserMeta to register the guesser classes to be implemented. I was planning to make the _guess dictionary mimic the TopologyAtrribute.transplant but I didn't find it has an application right now, so I postponed it to when I begin working on new guesser classes and see how it could work
What I would do next: 1- update modules where old guessing takes place (hydrogen bond analysis modules as I remember) 2- test and document what I have done so far
One more thing. Please give more explicit titles to your commits. When I receive an email notification about a commit entitled Update DefaultGuesser.py it is much less useful than if the title tells what you updated in the file.
What you should really focus on now, is to check that your code does not change the current behaviour. If you create a universe with the code in your pull request, do you get the same universe as if you create one with the main branch?
yes I check this after every update I make. But still need to make more thorough tests to make sure that everything is working fine
As you wrote in your summary email, it is time to focus on the tests. Before even writing new tests, most of the existing tests should pass. At this point, only the tests about the parsers should fail as they check if the parsers made the guessing.
Looking at the I report, much more tests than that are failing. There is a common cause for many tests: Universe.empty. We use Universe.empty to create simple test universes. This avoids having to write custom input files for every test, and it is faster than parsing those files (which is crucial as we want the tests to run as fast as possible).
Universe.empty should only contain what the user manually inputted. There, nothing should be guessed. Yet, maybe ironically, we have a universe that guesses much more than before! By making sure Universe.empty doesn't do any guessing, you should fix many tests. Have a look at: https://github.com/MDAnalysis/mdanalysis/blob/86efe83d77209ef7c6c555a2370cd07589b64a50/package/MDAnalysis/core/universe.py#L465
Looking at the CI error report, another issue seems that you try guessing the masses before guessing the types:
2022-08-01T04:37:58.3102637Z ___________ ERROR at setup of TestFHIAIMSWriter.test_write_selection ___________
2022-08-01T04:37:58.3102880Z [gw0] linux -- Python 3.8.13 /usr/share/miniconda/envs/test/bin/python
2022-08-01T04:37:58.3102886Z
2022-08-01T04:37:58.3102965Z @staticmethod
2022-08-01T04:37:58.3103050Z @pytest.fixture()
2022-08-01T04:37:58.3103125Z def u_no_resnames():
2022-08-01T04:37:58.3104523Z > return make_Universe(['names', 'resids'], trajectory=True)
2022-08-01T04:37:58.3104529Z
2022-08-01T04:37:58.3104732Z /home/runner/work/mdanalysis/mdanalysis/testsuite/MDAnalysisTests/coordinates/base.py:551:
2022-08-01T04:37:58.3104843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2022-08-01T04:37:58.3105038Z /home/runner/work/mdanalysis/mdanalysis/testsuite/MDAnalysisTests/dummy.py:79: in make_Universe
2022-08-01T04:37:58.3105134Z u = mda.Universe.empty(
2022-08-01T04:37:58.3105322Z /home/runner/work/mdanalysis/mdanalysis/package/MDAnalysis/core/universe.py:474: in empty
2022-08-01T04:37:58.3105399Z u = cls(top)
2022-08-01T04:37:58.3105578Z /home/runner/work/mdanalysis/mdanalysis/package/MDAnalysis/core/universe.py:381: in __init__
2022-08-01T04:37:58.3105717Z self.guess_TopologyAttributes(context, to_guess)
2022-08-01T04:37:58.3105934Z /home/runner/work/mdanalysis/mdanalysis/package/MDAnalysis/core/universe.py:1468: in guess_TopologyAttributes
2022-08-01T04:37:58.3106039Z values = guesser.guess_Attr(attr)
2022-08-01T04:37:58.3106224Z /home/runner/work/mdanalysis/mdanalysis/package/MDAnalysis/guesser/base.py:72: in guess_Attr
2022-08-01T04:37:58.3106382Z return self._guess[guess]()
2022-08-01T04:37:58.3106592Z /home/runner/work/mdanalysis/mdanalysis/package/MDAnalysis/guesser/default_guesser.py:62: in guess_masses
2022-08-01T04:37:58.3106716Z atom_types = self._universe.atoms.types
2022-08-01T04:37:58.3106892Z /home/runner/work/mdanalysis/mdanalysis/package/MDAnalysis/core/groups.py:2537: in __getattr__
2022-08-01T04:37:58.3107020Z return super(AtomGroup, self).__getattr__(attr)
2022-08-01T04:37:58.3107133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2022-08-01T04:37:58.3107139Z
2022-08-01T04:37:58.3107339Z self = <AtomGroup with 125 atoms>, attr = 'types'
2022-08-01T04:37:58.3107345Z
2022-08-01T04:37:58.3107440Z def __getattr__(self, attr):
2022-08-01T04:37:58.3107538Z selfcls = type(self).__name__
2022-08-01T04:37:58.3107638Z if attr in _TOPOLOGY_ATTRS:
2022-08-01T04:37:58.3107735Z cls = _TOPOLOGY_ATTRS[attr]
2022-08-01T04:37:58.3107860Z if attr == cls.singular and attr != cls.attrname:
2022-08-01T04:37:58.3108059Z err = ('{selfcls} has no attribute {attr}. '
2022-08-01T04:37:58.3108227Z 'Do you mean {plural}?')
2022-08-01T04:37:58.3108380Z raise AttributeError(err.format(selfcls=selfcls, attr=attr,
2022-08-01T04:37:58.3108499Z plural=cls.attrname))
2022-08-01T04:37:58.3108568Z else:
2022-08-01T04:37:58.3108807Z err = 'This Universe does not contain {singular} information'
2022-08-01T04:37:58.3108942Z > raise NoDataError(err.format(singular=cls.singular))
2022-08-01T04:37:58.3109147Z E MDAnalysis.exceptions.NoDataError: This Universe does not contain type information
2022-08-01T04:37:58.3109153Z
2022-08-01T04:37:58.3109340Z /home/runner/work/mdanalysis/mdanalysis/package/MDAnalysis/core/groups.py:609: NoDataError
(Ignore the left-most column; I am looking at the raw logs for CI, which contain timestamps.)
Hello everyone
today I finished any updates to the code in this pull request. I ensured that the only test that fails is the one expected to do so. The next thing is to remove these old tests and the guesser module inside the topology package and build new tests.
I made updates to the code to fix some errors as follows: 1- added checkpoints for when current automatic guessing must take place 2- handled the particular case of TXYZParser of guessing masses from names rather than atom types 3- check if masses and atom types need to be guessed (cause some parsers already read them which causes some conflict) 4- modified guess_bonds to be more robust and consistent with old behavior
About https://github.com/MDAnalysis/mdanalysis/pull/3753#issuecomment-1200827768
I have seen many changes recently, but none seemed to touch the line I mentioned above. Just to clarify, I expect that the issue with Universe.empty can be solved by changing this line alone.
Universe.empty is a "class method". Instead of taking an instance of the class, self, as first argument, it takes the class itself as cls. The line I am talking about creates the universe, it is equivalent to mda.Universe(top) therefore you can pass all the arguments a universe can take. Including the ones about guessing. I expect you can ask there for nothing to be guessed.
By the way, thank you for the pep8 fixes. I know it must have been annoying to do but it makes a big difference in the long run.
That's why I'm trying to check that guessing happens when a topology came from a parser that already used to call guessers, not any other source that doesn't need any automatic guessing to happen.
This seems to be your main issue at the moment. The one that has you add most of the complexity in the code.
Could you summarise what parsers try to guess what? This way we can try to find a generic solution together.
That's why I'm trying to check that guessing happens when a topology came from a parser that already used to call guessers, not any other source that doesn't need any automatic guessing to happen.
This seems to be your main issue at the moment. The one that has you add most of the complexity in the code.
Could you summarise what parsers try to guess what? This way we can try to find a generic solution together.
I made a summary of all parser's expected outcomes regarding atom types and masse
From your table, the rule looks rather simple : if it is read do not guess, even if it is in the list of things to guess. In any case what is directly read from the file should always take precedence.
That logic can be in the guess_TopologyAttribute method: if the attribute is already there do nothing.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 93.23%. Comparing base (
73acc9b) to head (4976166). Report is 4 commits behind head on develop.
Additional details and impacted files
@@ Coverage Diff @@
## develop #3753 +/- ##
===========================================
- Coverage 93.66% 93.23% -0.44%
===========================================
Files 168 12 -156
Lines 21248 1079 -20169
Branches 3917 0 -3917
===========================================
- Hits 19902 1006 -18896
+ Misses 888 73 -815
+ Partials 458 0 -458
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
What I have finished till the moment:
1- the BaseGusser and guess_topologyAtrributes basics
2 testing new features
What I'm working on: documentations and updating user guide
main features that I consider unmatured enough and more guessers are needed to see how I could improve them:
1- bond guessing; the bond attribute is a special case in handling and has its special methods inside the universe and the AtomGroup classes, so I think the current implementation of context-aware bond guessing will be updated, but I need to see how this could happen with different bond guessing methods from another guesser classes
2- BaseGuesser metaclass and registration of different attribute
We are removing documentation about things which are currently guessed by default
By default, masses and types are guessed if they cannot be read. This was the overall behaviour, but this PR centralises it in one place. The problem I see with having that fact documented in the individual parsers is that the behaviour can now be overwritten, and this is independent of the parser.
We are removing documentation about things which are currently guessed by default
By default, masses and types are guessed if they cannot be read. This was the overall behaviour, but this PR centralises it in one place. The problem I see with having that fact documented in the individual parsers is that the behaviour can now be overwritten, and this is independent of the parser.
If all the parsers behave in the same way and that all things are guessed equally, then that documentation can go somewhere centralised. If it's not, then you'll need a means of documenting which parsers have guessing and which don't.
In any case, the documentation will need to be somewhere?
We are removing documentation about things which are currently guessed by default
By default, masses and types are guessed if they cannot be read. This was the overall behaviour, but this PR centralises it in one place. The problem I see with having that fact documented in the individual parsers is that the behaviour can now be overwritten, and this is independent of the parser.
If all the parsers behave in the same way and that all things are guessed equally, then that documentation can go somewhere centralised. If it's not, then you'll need a means of documenting which parsers have guessing and which don't.
In any case, the documentation will need to be somewhere?
no guessing happens inside any parser now, so I'll add documentation inside universe regarding automatic guessing
We are removing documentation about things which are currently guessed by default
By default, masses and types are guessed if they cannot be read. This was the overall behaviour, but this PR centralises it in one place. The problem I see with having that fact documented in the individual parsers is that the behaviour can now be overwritten, and this is independent of the parser.
If all the parsers behave in the same way and that all things are guessed equally, then that documentation can go somewhere centralised. If it's not, then you'll need a means of documenting which parsers have guessing and which don't. In any case, the documentation will need to be somewhere?
no guessing happens inside any parser now, so I'll add documentation inside universe regarding automatic guessing
It's not whether or not the guessing happens inside the parsers, but rather that if there is parser-specific guessing, then it needs to be documented somewhere where we assume that users will be reading about that specific parsers.
In any case, if things move out of the parser docstring, I'd also ask that we add versionchanged entries in the relevant parsers to let users know where to look about the default guessing behaviour.
We are almost there functionally. Do you keep track of what still needs to be done?
🤔 I think github is to blame here, stuff seems to be magically resolving and unresolving every time I click on "files changed"
🤔 I think github is to blame here, stuff seems to be magically resolving and unresolving every time I click on "files changed"
there is an issue in my recent pull requests also, some changes that I made seem to be overwritten by an older version of the code. I'll try to see where this mess happened and resolve it
With #3779 merged, you should merge the develop branch into this one. Be careful that there is at least one merge conflict to solve.
I finally added the last part of functionality to the code (mainly partial guessing and parser tests). The last thing to do is update the docstring and review the PR comments to see if I missed something + modifications according to new suggestions (hope to have as many as it could be XD)
Do not forget to spend some time on the documentation. There, you have 3 distinct audiences:
- regular users of MDAnalysis who will use the guesser
- advanced users who will develop new guessers
- developers who will work on a topology attribute
I added the docstring regarding the guessing part now.