Feature non linear time averaged msd
Fixes #5028
Changes made in this Pull Request:
- msd.py script was modified to support time-averaged mean square displacement calculation for dump files with non-linear time spacings.
PR Checklist
- [ ] Issue raised/referenced? Yes Issue #5028
- [ ] Tests updated/added? Not Yet
- [ ] Documentation updated/added? Updated short documentation at the top of the msd.py as comments
- [ ]
package/CHANGELOGfile updated? Not Yet - [ ] Is your name in
package/AUTHORS? (If it is not, add it!) Yes - Added
Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--5066.org.readthedocs.build/en/5066/
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 93.87%. Comparing base (91146c5) to head (29df80e).
Additional details and impacted files
@@ Coverage Diff @@
## develop #5066 +/- ##
========================================
Coverage 93.86% 93.87%
========================================
Files 179 179
Lines 22217 22252 +35
Branches 3157 3161 +4
========================================
+ Hits 20853 20888 +35
Misses 902 902
Partials 462 462
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Thank you @IAlibay @orbeckst and @mrshirts for your time to review the initial PR. Based on your suggestions I have modified a few things:
-
Added a keyword argument
non_linearinstead of a check like the initial commitis_linear.non_linearis set toFalseand only setTrueby the user to call the new method_conclude_non_linear. This is done to check if the GH Actions CI tests get cleared or not as the code executes to the default route.The new method only executes when
fft=Falseandnon_linear = True -
Removed the comments for documentation at the top. I will update that in the doc strings later once we are satisfied with the code.
-
Removed the print statements.
-
import collectionsat top of the file. -
deleted
positions = np.zeros((n_frames, n_atoms, 3))as it was not necessary.
Necessity of a new method :
-
The
_conclude_simplemethod calculated the time-averaged msd relying on the number of frames a particular dump file has but not relying on at what timestep did each frame got dumped at. So, a better method is needed to calculate "time-averaging" not "frame-averaging". -
Moreover If the new method
_conclude_non_linearis used on a linearly dumped file it generates same msd values as the default_conclude_simplemethod generates. This is because the new method solely uses the timestamps each frame is at which for a linearly dumped file is nothing but the frame index!.
Algorithm of _conclude_non_linear to calculate time-averaged msd :
-
It creates a dictionary/hash-map with the keys being the time difference between the frames
Delta_tand values being a list containing the frame to frameMSDvalues[msd1, msd2, .....]for that particular Delta tDictionary to collect MSDs: {Δt: [msd1, msd2, ...]}. -
The reference frame is changed with each iteration and each time a new Key (i.e, Delta t) is found it is appended in the dictionary/hash-map.
-
If a duplicate Delta_t is found it is added to the value (msd list) of that specefic Key in the dictionary/hash-map.
-
Lastly in the dictionary/hash-map for each Delta_t the msd values inside the list are averaged.
Example Performance comparison:
A coarse grained LAMMPS trajectory with N=60,000 and 21 frames (non-linearly spaced timesteps) is run with fft turned off:
It took around 2.9s with the new method (toggled on using non_linear = True).
and took around 1.6s with the default method.
(This is not including the plotting time in the user interface)
The big picture here is that for polymer MSD's, the results are highly nonlinear in the amount of time (unlike with liquids, where the MSD becomes linear very quickly). In fact, generally they have a power law dependence. To calculate MSD at long log times ends up requiring a huge amount of data; you get the short time behavior very accurately, and long time behavior less accurately, but you don't really need the short time behavior more accurately than the long time behavior.
So LAMMPS has a standard option to output the trajectory files where every N steps, you output steps logarithmically: perhaps with spacing (roughly): 1, 10, 100, 1000. So for every N steps, you are only outputting 4 points. Then maybe your simulation is 100/N steps long, so ,you will get 100 data points for lags of 1, 10, 100, 1000, etc, and you get equal accuracy for all lags, at a cost of some precision at short times, but using way less storage space (polymer simulations can have a lot of particles!). Polymer scientists use this a lot.
The issue is that the standard code can't handle this, and I'm actually not sure if it can be handled with FFTs. So I'm not totally sure if it makes sense to have a single code path (although that would be desirable). What one should have is that one gets the same answer for LINEAR data (within machine precision) if it goes through either path, which can be automatically checked, so at least any code duplication is validated.
Current changes:
-MSD.results.timeseries now stores the msd values and MSD.results.delta_t_values stores the delta t values for which time average msds are calculated (In previous commit it was stored in results.timeseries). This makes sure MSD.results.timeseries array stores results which is consistent as per the documentation.
-Simple plot can be generated with:
msd = MSD.results.timeseries
ax.plot(MSD.results.delta_t_values, msd)
or simply ax.plot(MSD.results.delta_t_values, MSD.results.timeseries)
- No performance hit and deals with sliced trajectory
Future implementation in progress:
The MSD.results.msds_by_particle (a 2D array) as mentioned in msds_by_particle is "The MSD of each individual particle with respect to lag-time", For uniform dumps this is very straightforward with each row of the array being the lag-time. Each lag (Δt) corresponds to a fixed number of frames, so we can neatly store MSD per particle, per lag-index.
However In the non-linear (irregular Δt) case: msds_by_particle is defined as the time-averaged MSD of each particle computed at actual Δt values (i.e., time differences between all frame pairs), rather than fixed frame lags. This accounts for non-uniform sampling and ensures MSD is evaluated based on physical time rather than frame index.
msds_by_particle[Δt_index, i] = ⟨|rᵢ(t + Δt) − rᵢ(t)|²⟩_t
In short for the _conclude_non_linear() method MSD.results.msds_by_particle will be a 2D array with rows being the numbers of unique delta_t values that can be obtained from MSD.results.delta_t_values and columns being the MSD for each particle for that specific delta_t. This is also consistent of the array structure mentioned ensuring easy plotting of particle specific msd over time-lags.
I will push once this implementation is done shortly and update you with the performance. Thank you again for your valuable reviews!
I have some comments on the code structure.
The biggest thing is to add tests that exercise all your code and check that it produces the results you expect. Have a look at the code coverage. It should be >= 94%.
You'll also need to update the docs and the CHANGELOG.
Hi! Thanks for the feedback. I'm not very familiar with writing tests in Python yet. Could you please guide me or point me to an example from this repo so I can follow the same pattern?
(For now I am running the code locally with an example lammpsdump file every time I update the code to check for results)
I see the codecov/project is at 93.5%. Can you please let me know how can it be increased is it just by adding tests?
Happy to update the CHANGELOG and docs right after!
Testing
Writing test takes some time to learn. We use pytest as our testing framework so you may have to keep looking up things there.
See https://userguide.mdanalysis.org/stable/testing.html#testing for more information.
For the MSD code specifically, start (and add your tests) in testsuite/MDAnalysisTests/analysis/test_msd.py.
Coverage
For coverage look in the status check box: Right now it says codecov/patch – 19.35% of diff, which means less than 20% of the lines of codes that you touched are being tested. This number should be >94%. You can click on the status check to see which lines specifically are not covered and you can also look at the Files changed tab on this PR and see the codecov annotations of missing coverage.
You should add your lammpsdump data file to the testsuite (see develop/testsuite/MDAnalysisTests/datafiles.py). Make it as small as possible and bz2 compress it. Ideally, it should be much less than 1 MB in size. Otherwise, produce a small file for testing. It does not need to produce scientifically solid results, just exercise all the code that you're writing in the same way that "real" data would.
I am going to assign myself as PR shepherd but if anyone else wants to take over or be involved, please do!
With the new commit:
- CHANGELOG has been edited
- Docstrings has been edited
- An initial Test has been added for test_msd.py
This initial test currently checks if results.timeseries and results.delta_t_values are matched with the expected value. (NOTE: results.msds_by_particle is yet to be implemented in the original msd.py code inside the new method)
It seems like some of the automated tests are failing although when I did pytest testsuite/MDAnalysisTests/analysis/test_msd.py locally it passed. I am a little confused about this.
Ran black after each.py codes are edited
It seems from the errors that the filename needs to be added in datafiles.py > __all__
It seems like some of the automated tests are failing although when I did
pytest testsuite/MDAnalysisTests/analysis/test_msd.pylocally it passed. I am a little confused about this.
fixed by adding the test file name to __all__ in datafiles.py.
codecov/patch has increased to 100%.
The linter seems to giving errors for msd.py although black was run after edits.
Did you run black with version 24? (Check black --version and if necessary pin to 24)
Did you run black with version 24? (Check
black --versionand if necessary pin to 24)
I was running black v25. Downgraded and ran black 24.10.0 with [c90f52e]
* revert black formatting for any files that you didn't touch — _only reformat the files that you've been working on_, please.
Can you please let me know how to revert back the changes for the recent commit locally. When I ran black v24.10.0 on msd.py it said 1 file unchanged So, I did black ., which changed every .py file.
Thank you for all the suggestions will get to those shortly!
If you look at https://github.com/MDAnalysis/mdanalysis/pull/5066/files you see that 260(!) files were changed. I would look into reverting the commit that did that.
Alternatively, find the files where you definitely added code (msd.py. test_msd.py, CHANGELOG, AUTHORS, ... ?) and revert everything else. Look at git revert, git reset (be careful), and also git restore. I also saw "See "Reset, restore and revert" in git(1) for the differences between the three commands.".
I added the tests to calculate msds for all dimensions and using the start/stop/step in separate functions inside a new class. Let me know if that's how it should be or if there's any way to do it in fewer lines of code for test_msd.py
The tests look ok to me. The black-formatting is a bit annoying as it takes up so much vertical space but I don't know how to change it and it's not really important so let's leave it.
You're testing all your functionality, so that's great!
Thank you! I will do it shortly. Yes I also saw the commit suggestion button but was a bit skeptical to use it as I thought I have to edit the changes locally once again. Thanks for the clarification.
When you commit a suggestion, it is added to the branch on GitHub. You can then pull in these changes with git pull.
I'm close to implementing results.msds_by_particle for now, it's just a null array. Once that's done I'll add the corresponding tests to the new class and functions following the same pattern.
Future implementation in progress: The
MSD.results.msds_by_particle(a 2D array) as mentioned in msds_by_particle is "The MSD of each individual particle with respect to lag-time", For uniform dumps this is very straightforward with each row of the array being thelag-time. Each lag (Δt) corresponds to a fixed number of frames, so we can neatly store MSD per particle, per lag-index.However In the non-linear (irregular Δt) case:
msds_by_particleis defined as the time-averaged MSD of each particle computed at actual Δt values (i.e., time differences between all frame pairs), rather than fixed frame lags. This accounts for non-uniform sampling and ensures MSD is evaluated based on physical time rather than frame index.
msds_by_particle[Δt_index, i] = ⟨|rᵢ(t + Δt) − rᵢ(t)|²⟩_tIn short for the
_conclude_non_linear()methodMSD.results.msds_by_particlewill be a 2D array with rows being the numbers of unique delta_t values that can be obtained fromMSD.results.delta_t_valuesand columns being the MSD for each particle for that specific delta_t. This is also consistent of the array structure mentioned ensuring easy plotting of particle specific msd over time-lags.I will push once this implementation is done shortly and update you with the performance. Thank you again for your valuable reviews!
In the latest commit -
i. msds_by_particle is added in msd.py and some tests are added. The results are huge so I though to keep it to xyz only.
ii. code blocks for sorting delta_t keys and making average msd is reduced
iii. In the new method code can be conceptually separated in blocks - Initialization block, calculation block, results/averaging block (following correct format as per docs), actually storing the results in
self.results.... block
Assigning self.results.timeseries = self.results.msds_by_particle.mean(axis=1) (as done in the two default methods) yields same result. This further ensures consistency in shape and how msds_by_particle is computed and propagated through the algorithm.
The trade off of times between executing both the methods (simple and the non-linear) does not seem that much -
Here is a local performance comparison - test file = CG dump; N=60000; Frames=21
Default method:
Modified method:
There's an internal loop in the interface code that loops over the dimensions that is also taking some time. Both takes about 1s less without the plots. I presume it is handling huge files well.
Your test file is not packaged, see build_testsuite_sdist in https://github.com/MDAnalysis/mdanalysis/actions/runs/16632261268/job/47064413158?pr=5066
YOu'll need to add a line in testsuite/pyproject.toml after https://github.com/MDAnalysis/mdanalysis/blob/96f0b67f97d23b11c3ea627ec491efe87b3c900a/testsuite/pyproject.toml#L118 like
"data/analysis/msd/*",
so that the files are included.
You can then manually build the dist file – I'd try
cd testsuite
python -m build
(mamba install python-build if needed) or you can try what the CI does (pipx run build --sdist --outdir dist testsuite). Look at the output to check that your file is included. You can also install the resulting dist file to check locally.
I’ve updated the definitions and am happy to revise them further if a more concise/better version is suggested. Thank you!
When reading the docs I also realized that there's the big warning note about "unwrapping" trajectories https://mdanalysis--5066.org.readthedocs.build/en/5066/documentation_pages/analysis/msd.html#computing-an-msd but that's outdated. I'll change it in this PR. You might just have to git pull before pushing.
results.delta_t_values is now available for both simple and fft based methods. dump_times[1] - dump_times[0] for a linearly dumped file should be nothing but the "actual time difference between two frames", which is a constant!, previously was instructed to define via timestep=1 in the docs.
Thank you for the review.
Thanks for the updates. I only have minor cosmetic changes.
While reviewing, I also realized that your code can solve a problem that has been inherent in the existing code: The
run()method can take the frames=[0, 20, 25, 100, 1001, ...]` kwarg to select any frames instead of start/stop/step and if that's the case, the "linear" approach is incorrect (but that case is not being caught). We should select your new code path for un-even frame separations.I am not asking for this to be resolved in this PR. If you agree that this is a problem and that your code could solve it, could you raise an issue for it so that it's not forgotten.
That is a very good observation and yes it is wrong to use the default method for such non-linearly spaced kwargs. I will surely raise an issue and I think just diverting the code path to the method for when such kwargs are present would be a good solution.
A test can also be added and later modifications can be performed to work the issue out.
I see that the two codecov tests haven't run. Is there any particular reason? Just curious