mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

ENH: add TPR position and velocity read support

Open tylerjereddy opened this issue 11 months ago • 16 comments

Fixes gh-464.

  • Add support for reading positions and velocities from GMX .tpr files over the full range of tpx versions supported by regular .tpr topology parsing (58 through 137). We've had an issue open to add this capability for almost a decade now.

Notes for reviewers:

  1. Suggestions for removing the use of global to track TPR file precision are appreciated.
  2. Performance/duplication considerations -- as Richard noted in the cognate issue, the way our topology and coordinate handling code is organized, it may make sense to keep the separation here, but there should at least be opportunity to reduce code duplication if not completely avoid re-reading the topology to seek to the coordinate positions); that said, our .tpr reading performance is atrocious anyway and my attempts to fix at gh-4098 have stalled for more than a year, so I suggest we defer performance (and probably even duplication) considerations until after the capability and tests are cemented in.
  3. Please help me identify TPR testing scenarios that are missing, and ideally suggest the TPR file(s) I should use to test them.
  4. I'll probably deal with CHANGELOG and docs changes when it looks like we're close to ready to merge, but I still appreciate suggestions on i.e., locations where doc changes will be needed.

PR Checklist

  • [x] Tests?
  • [ ] Docs?
  • [ ] CHANGELOG updated?
  • [ ] Issue raised/referenced?

Developers certificate of origin


📚 Documentation preview 📚: https://mdanalysis--4873.org.readthedocs.build/en/4873/

tylerjereddy avatar Dec 29 '24 21:12 tylerjereddy

Hello @tylerjereddy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 59:80: E501 line too long (89 > 79 characters) Line 61:80: E501 line too long (93 > 79 characters) Line 75:80: E501 line too long (81 > 79 characters)

Line 15:80: E501 line too long (119 > 79 characters) Line 16:23: E261 at least two spaces before inline comment Line 22:5: E124 closing bracket does not match visual indentation Line 24:23: E261 at least two spaces before inline comment Line 30:5: E124 closing bracket does not match visual indentation Line 32:16: E261 at least two spaces before inline comment Line 33:19: E241 multiple spaces after ',' Line 33:33: E241 multiple spaces after ',' Line 34:20: E241 multiple spaces after ',' Line 34:34: E241 multiple spaces after ',' Line 38:5: E124 closing bracket does not match visual indentation Line 40:21: E261 at least two spaces before inline comment Line 41:19: E241 multiple spaces after ',' Line 41:33: E241 multiple spaces after ',' Line 42:19: E241 multiple spaces after ',' Line 42:33: E241 multiple spaces after ',' Line 44:20: E241 multiple spaces after ',' Line 46:5: E124 closing bracket does not match visual indentation Line 47:14: E261 at least two spaces before inline comment Line 48:19: E241 multiple spaces after ',' Line 48:33: E241 multiple spaces after ',' Line 49:20: E241 multiple spaces after ',' Line 49:34: E241 multiple spaces after ',' Line 53:5: E124 closing bracket does not match visual indentation Line 54:14: E261 at least two spaces before inline comment Line 55:19: E241 multiple spaces after ',' Line 55:33: E241 multiple spaces after ',' Line 56:20: E241 multiple spaces after ',' Line 56:34: E241 multiple spaces after ',' Line 60:5: E124 closing bracket does not match visual indentation

Comment last updated at 2024-12-31 17:39:01 UTC

pep8speaks avatar Dec 29 '24 21:12 pep8speaks

Codecov Report

Attention: Patch coverage is 94.44444% with 5 lines in your changes missing coverage. Please review.

Project coverage is 93.63%. Comparing base (d412c9a) to head (006ccb3).

Files with missing lines Patch % Lines
package/MDAnalysis/coordinates/TPR.py 90.90% 0 Missing and 4 partials :warning:
package/MDAnalysis/topology/tpr/utils.py 97.77% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #4873   +/-   ##
========================================
  Coverage    93.63%   93.63%           
========================================
  Files          177      178    +1     
  Lines        22033    22121   +88     
  Branches      3115     3143   +28     
========================================
+ Hits         20631    20714   +83     
  Misses         948      948           
- Partials       454      459    +5     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Dec 29 '24 21:12 codecov[bot]

@tylerjereddy since this PR is composed of two new files, and the modifications are all in one place, would you mind if I go ahead with #4849? I don't think there would be significant conflicts.

RMeli avatar Dec 29 '24 22:12 RMeli

ok

tylerjereddy avatar Dec 29 '24 22:12 tylerjereddy

Thanks @tylerjereddy. It is now merged.

RMeli avatar Dec 29 '24 22:12 RMeli

I drafted in support/testing for handling velocities as well, plus support/testing for reading positions/velocities from one more tpx version (133). I'm trying to skip the CI when I push in for the next little bit (looks like Azure still ran though..).

Need to poke around the .tpr test files to see if there are some more interesting ones as well re: non-zero velocities, etc.

tylerjereddy avatar Dec 30 '24 23:12 tylerjereddy

I was able to extend support and testing for .tpr position/velocity reads back as far as GMX 2023 (tpx version 129); GMX 2022rc1 (tpx 127) will fail testing because it needs more binary file striding version checks that I'll need to study in the C++ code (not too hard, but time consuming).

I also modernized an old TPR file we had with non-zero velocities (and a large number of atoms) for cobrotoxin via gmx convert-tpr and then added a test case for this. It passes for positions, but not for velocities yet, so that was a good intuition to check non-zero velocity/larger system case. Since that was converted to tpx 134 it will be a little harder to sort out the problem there, likely requiring side-by-side printf-vs-print for GMX vs. us on the binary striding.

tylerjereddy avatar Dec 31 '24 17:12 tylerjereddy

I also modernized an old TPR file we had with non-zero velocities (and a large number of atoms) for cobrotoxin via gmx convert-tpr and then added a test case for this

This test case for non-zero velocities should be passing now, as should other tests, apart from the linter, after some more bug fixes.

The outstanding items here would likely be:

  1. going back to support a wider range of tpx versions
  2. probably a few more tpr files with non-zero velocities across generations should be tested

tylerjereddy avatar Apr 19 '25 04:04 tylerjereddy

Updates from work today:

  • Added support for reading positions and velocities from tpx version 137 (GROMACS 2025.0), along with a regression test confirming position and velocity reads from a .tpr file from that version of GMX (expected values from gmx dump as usual)
  • finally fixed up the black linter complaints

tylerjereddy avatar Apr 19 '25 19:04 tylerjereddy

Added support for tpx version 127 today (GROMACS 2022)--two regression tests with different numbers of atoms were used in this case because I had to do manual binary offset analysis vs. newer versions of the format to get the positions properly extracted.

Next up from our support table should probably be tpx 122 (GROMACS 2021), maybe tomorrow.

tylerjereddy avatar Apr 20 '25 19:04 tylerjereddy

For today, I added support for tpx version 122 (GROMACS 2021) + two regression tests. Luckily, this was easier than the last two versions I added, no byte skipping shims needed.

For tomorrow, I'll see if I can add support for tpx version 119 (GROMACS 2020).

tylerjereddy avatar Apr 21 '25 17:04 tylerjereddy

Today, added support for tpx version 119 (GROMACS 2020), which required minor byte striding adjustments to pass the usual two regression tests.

For tomorrow, I'll aim to extend backward to tpx 116 (GROMACS 2019).

tylerjereddy avatar Apr 22 '25 18:04 tylerjereddy

Today, support was added for tpx 116 (GMX 2019) and tpx 112 (GMX 2018). The latter required some non-trivial byte-stride analysis on the binary files.

For tomorrow, I'll take aim at tpx 110 (GMX 2016) if I can.

tylerjereddy avatar Apr 23 '25 15:04 tylerjereddy

Today, added support for tpx 110 (GROMACS 2016) and tpx 103 (GROMACS 5.1).

I'll try to check tpx 100 (GMX 5.x before 5.1) tomorrow.

tylerjereddy avatar Apr 24 '25 17:04 tylerjereddy

Today, added support for tpx version 100 (GMX 5.x before 5.1). No byte striding shims were needed.

Next up is tpx version 83 (GMX 4.6.x)--I did a quick check and that will require more analysis/byte striding.

tylerjereddy avatar Apr 25 '25 14:04 tylerjereddy

Ok, this weekend I added position and velocity reading support for all remaining tpx versions that MDAnalysis can read topology from under normal circumstances.

I think this is ready for review now--I've updated the original comment above with some notes for reviewers. The linter will currently fail because I'm using a global construct to track TPR precision.

Suggestions for fixing that and anything else you find are appreciated.

tylerjereddy avatar Apr 27 '25 22:04 tylerjereddy

@orbeckst I took at stab at addressing your and my review comments, after having to rebase to deal with gh-5061. Pasting from the commit message below the fold:

Summary of latest revisions to PR 4873
    DOC, TST, BUG: PR 4873 revisions
    
    * Add TPR position and velocity reading support to the
    coordinate reading support table in our docs, as requested
    by reviewer.
    
    * Add a regression test for the proper raising of an error
    when we use a proper TPR topology file, but an improper format
    TPR coordinate file. This also revealed a bug in `TPR.py`
    where the `logger` was not initialized, which is now fixed
    and guarded by the same regression test. Note that this
    new regression test does correctly distinguish the error
    message for an improper coordinate TPR, which is distinct
    from an improper topology TPR (I adjusted the error message
    to mention coordinate parsing specifically).
    
    * It may be worth noting that before dealing with the above
    point, I hadn't thought much about the `u = mda.Universe(tpr_file, tpr_file)`
    incantation vs. extracting topology and coordinates from the single
    argument. That said, the separated-argument incantation does appear to
    work just fine as well, so we could consider i.e., parametrizing our TPR
    coordinate/velocity testing infrastructure to probe both incantations
    if desirable.
    
    * The use of a `global` variable for storing tpx precision data
    has been removed in favor of adjuting `do_mtop()` to accept
    this information as a new argument. This addresses one of my
    own review comments (and a linter complaint about globals).

One remaining action item that your comments got me thinking about is the incantations like u = mda.Universe(tpr_file, tpr_file). This does seem to work just fine, as the detailed review comment below the fold indicates, but there is perhaps a question of wanting to test that incantation as well. I could parametrize the current test to probe both incantations I suppose.

Anything else that we are worried about? Possibly assessing performance impact, though I believe we may be in agreement that going for the new functionality first and then dealing with performance issues later may be "ok."

tylerjereddy avatar Jul 18 '25 19:07 tylerjereddy

Super-quick comment: Looks as if you addressed all concerns. I'd merge correct code even if there's still optimization to be done. However, is there any performance regression for the cases that TPRs had been used for in the past, i.e., do your changes incur an overhead even if x/v are not needed as in Universe(TPR, XTC)?

orbeckst avatar Jul 18 '25 19:07 orbeckst

Checking that u = mda.Universe(tpr_file, tpr_file) and u = mda.Universe(tpr_file) both work would be good.

orbeckst avatar Jul 18 '25 19:07 orbeckst

Another thing I didn't test, but that actually works, is different versions of TPR for topology vs. x/v reading: u = mda.Universe(TPR2020, TPR2024_4). That's perhaps slightly mind bending, but I suppose there is no reason it shouldn't work. I'm not sure how far I'd want to go in officially supporting/testing that at this time, but just noticed it while addressing comments locally. Maybe I'll just add a test for a single case like that for now.

tylerjereddy avatar Jul 18 '25 20:07 tylerjereddy

Yes, add a one off test, maybe it breaks in interesting ways in the future and reveals a real issue. Might eventually be useful when we optimize towards reading only once instead of separately for topology and x/v.

orbeckst avatar Jul 18 '25 20:07 orbeckst

I pushed in a bunch more revisions with a detailed commit message. Probably some kind of linter/docstring issue will crop up now.

Let me see what I can do about checking the performance implications for prior usage a bit.

tylerjereddy avatar Jul 18 '25 21:07 tylerjereddy

I'm having trouble running the asv benchmarks locally on both Mac and Linux. The sample benchmarks I drafted are below the fold (looks like this file may have annoyances with carriage returns as well?). My sample incantation was: asv continuous -E virtualenv -v -e -b "TPRBench" develop treddy_issue_464

--- a/benchmarks/benchmarks/topology.py
+++ b/benchmarks/benchmarks/topology.py
@@ -4,7 +4,8 @@ from MDAnalysis.guesser import DefaultGuesser
 
 try:
     from MDAnalysis.exceptions import NoDataError
-    from MDAnalysisTests.datafiles import GRO
+    from MDAnalysisTests.datafiles import (GRO, TPR_xvf_2024_4,^M
+                                           TPR2023, TPR406)^M
 except:
     pass
 
@@ -45,3 +46,26 @@ class BondsBench(object):
     def time_bonds(self, num_bonds):
         """Benchmark for calculating bonds"""
         self.u.bonds
+^M
+^M
+class TPRBench:^M
+    """^M
+    Benchmarks for reading in TPR files, for which we^M
+    support both topology and coordinate/velocity parsing.^M
+    """^M
+    # benchmark over a variety of TPR versions and^M
+    # system sizes^M
+    params = (TPR_xvf_2024_4, # 19385 atoms^M
+              TPR2023, # 2263 atoms^M
+              TPR406, # 2263 atoms^M
+             )^M
+    param_names = ["tpr_file"]^M
+^M
+^M
+    def time_single_incantation_tpr(self, tpr_file):^M
+        MDAnalysis.Universe(tpr_file)^M
+^M
+^M
+    def time_double_incantation_tpr(self, tpr_file):^M
+        # only possible from MDA 2.10.0 onward^M
+        MDAnalysis.Universe(tpr_file, tpr_file)^M

I see Failed to build the project and import the benchmark suite. shortly after Successfully built MDAnalysis in a verbose log... don't have time to dig further on that at the moment. If other folks regularly run the benchmarks in a venv setup I'm open to suggestions... it looks like asv.conf.json may also need some modernization (some projects need to customize specific build/install cmds in there too).

tylerjereddy avatar Jul 18 '25 22:07 tylerjereddy

Just after reviewing I looked at the docs and realized that the new TPR coordinate docs were not linked. I quickly added the files... hopefully all looks good.

orbeckst avatar Aug 03 '25 23:08 orbeckst

Congratulations @tylerjereddy for closing a decade-old issue!

(Btw, I carefully preserved and edited your log messages in the squash-commit as they contain a plethora of useful information.)

orbeckst avatar Aug 04 '25 02:08 orbeckst