mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

Feature non linear time averaged msd

Open gitsirsha opened this issue 9 months ago • 1 comments

Fixes #5028

Changes made in this Pull Request:

  • msd.py script was modified to support time-averaged mean square displacement calculation for dump files with non-linear time spacings.

PR Checklist

  • [ ] Issue raised/referenced? Yes Issue #5028
  • [ ] Tests updated/added? Not Yet
  • [ ] Documentation updated/added? Updated short documentation at the top of the msd.py as comments
  • [ ] package/CHANGELOG file updated? Not Yet
  • [ ] Is your name in package/AUTHORS? (If it is not, add it!) Yes - Added

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--5066.org.readthedocs.build/en/5066/

gitsirsha avatar Jun 15 '25 18:06 gitsirsha

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 93.87%. Comparing base (91146c5) to head (29df80e).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #5066   +/-   ##
========================================
  Coverage    93.86%   93.87%           
========================================
  Files          179      179           
  Lines        22217    22252   +35     
  Branches      3157     3161    +4     
========================================
+ Hits         20853    20888   +35     
  Misses         902      902           
  Partials       462      462           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jun 15 '25 18:06 codecov[bot]

Thank you @IAlibay @orbeckst and @mrshirts for your time to review the initial PR. Based on your suggestions I have modified a few things:

  1. Added a keyword argument non_linear instead of a check like the initial commit is_linear. non_linear is set to False and only set True by the user to call the new method _conclude_non_linear . This is done to check if the GH Actions CI tests get cleared or not as the code executes to the default route.

    The new method only executes when fft=False and non_linear = True

  2. Removed the comments for documentation at the top. I will update that in the doc strings later once we are satisfied with the code.

  3. Removed the print statements.

  4. import collections at top of the file.

  5. deleted positions = np.zeros((n_frames, n_atoms, 3)) as it was not necessary.

Necessity of a new method :

  • The _conclude_simple method calculated the time-averaged msd relying on the number of frames a particular dump file has but not relying on at what timestep did each frame got dumped at. So, a better method is needed to calculate "time-averaging" not "frame-averaging".

  • Moreover If the new method _conclude_non_linear is used on a linearly dumped file it generates same msd values as the default _conclude_simple method generates. This is because the new method solely uses the timestamps each frame is at which for a linearly dumped file is nothing but the frame index!.

Algorithm of _conclude_non_linear to calculate time-averaged msd :

  • It creates a dictionary/hash-map with the keys being the time difference between the frames Delta_t and values being a list containing the frame to frame MSD values [msd1, msd2, .....] for that particular Delta t Dictionary to collect MSDs: {Δt: [msd1, msd2, ...]}.

  • The reference frame is changed with each iteration and each time a new Key (i.e, Delta t) is found it is appended in the dictionary/hash-map.

  • If a duplicate Delta_t is found it is added to the value (msd list) of that specefic Key in the dictionary/hash-map.

  • Lastly in the dictionary/hash-map for each Delta_t the msd values inside the list are averaged.

Example Performance comparison:

A coarse grained LAMMPS trajectory with N=60,000 and 21 frames (non-linearly spaced timesteps) is run with fft turned off:

It took around 2.9s with the new method (toggled on using non_linear = True). and took around 1.6s with the default method.

(This is not including the plotting time in the user interface)

output_default output_new

gitsirsha avatar Jun 30 '25 22:06 gitsirsha

The big picture here is that for polymer MSD's, the results are highly nonlinear in the amount of time (unlike with liquids, where the MSD becomes linear very quickly). In fact, generally they have a power law dependence. To calculate MSD at long log times ends up requiring a huge amount of data; you get the short time behavior very accurately, and long time behavior less accurately, but you don't really need the short time behavior more accurately than the long time behavior.

So LAMMPS has a standard option to output the trajectory files where every N steps, you output steps logarithmically: perhaps with spacing (roughly): 1, 10, 100, 1000. So for every N steps, you are only outputting 4 points. Then maybe your simulation is 100/N steps long, so ,you will get 100 data points for lags of 1, 10, 100, 1000, etc, and you get equal accuracy for all lags, at a cost of some precision at short times, but using way less storage space (polymer simulations can have a lot of particles!). Polymer scientists use this a lot.

The issue is that the standard code can't handle this, and I'm actually not sure if it can be handled with FFTs. So I'm not totally sure if it makes sense to have a single code path (although that would be desirable). What one should have is that one gets the same answer for LINEAR data (within machine precision) if it goes through either path, which can be automatically checked, so at least any code duplication is validated.

mrshirts avatar Jun 30 '25 22:06 mrshirts

Current changes: -MSD.results.timeseries now stores the msd values and MSD.results.delta_t_values stores the delta t values for which time average msds are calculated (In previous commit it was stored in results.timeseries). This makes sure MSD.results.timeseries array stores results which is consistent as per the documentation.

-Simple plot can be generated with: msd = MSD.results.timeseries ax.plot(MSD.results.delta_t_values, msd)

or simply ax.plot(MSD.results.delta_t_values, MSD.results.timeseries)

  • No performance hit and deals with sliced trajectory

Future implementation in progress: The MSD.results.msds_by_particle (a 2D array) as mentioned in msds_by_particle is "The MSD of each individual particle with respect to lag-time", For uniform dumps this is very straightforward with each row of the array being the lag-time. Each lag (Δt) corresponds to a fixed number of frames, so we can neatly store MSD per particle, per lag-index.

However In the non-linear (irregular Δt) case: msds_by_particle is defined as the time-averaged MSD of each particle computed at actual Δt values (i.e., time differences between all frame pairs), rather than fixed frame lags. This accounts for non-uniform sampling and ensures MSD is evaluated based on physical time rather than frame index.

msds_by_particle[Δt_index, i] = ⟨|rᵢ(t + Δt) − rᵢ(t)|²⟩_t

In short for the _conclude_non_linear() method MSD.results.msds_by_particle will be a 2D array with rows being the numbers of unique delta_t values that can be obtained from MSD.results.delta_t_values and columns being the MSD for each particle for that specific delta_t. This is also consistent of the array structure mentioned ensuring easy plotting of particle specific msd over time-lags.

I will push once this implementation is done shortly and update you with the performance. Thank you again for your valuable reviews!

gitsirsha avatar Jul 14 '25 19:07 gitsirsha

I have some comments on the code structure.

The biggest thing is to add tests that exercise all your code and check that it produces the results you expect. Have a look at the code coverage. It should be >= 94%.

You'll also need to update the docs and the CHANGELOG.

Hi! Thanks for the feedback. I'm not very familiar with writing tests in Python yet. Could you please guide me or point me to an example from this repo so I can follow the same pattern? (For now I am running the code locally with an example lammpsdump file every time I update the code to check for results)

I see the codecov/project is at 93.5%. Can you please let me know how can it be increased is it just by adding tests?

Happy to update the CHANGELOG and docs right after!

gitsirsha avatar Jul 15 '25 16:07 gitsirsha

Testing

Writing test takes some time to learn. We use pytest as our testing framework so you may have to keep looking up things there.

See https://userguide.mdanalysis.org/stable/testing.html#testing for more information.

For the MSD code specifically, start (and add your tests) in testsuite/MDAnalysisTests/analysis/test_msd.py.

Coverage

For coverage look in the status check box: Right now it says codecov/patch – 19.35% of diff, which means less than 20% of the lines of codes that you touched are being tested. This number should be >94%. You can click on the status check to see which lines specifically are not covered and you can also look at the Files changed tab on this PR and see the codecov annotations of missing coverage.

image

orbeckst avatar Jul 15 '25 19:07 orbeckst

You should add your lammpsdump data file to the testsuite (see develop/testsuite/MDAnalysisTests/datafiles.py). Make it as small as possible and bz2 compress it. Ideally, it should be much less than 1 MB in size. Otherwise, produce a small file for testing. It does not need to produce scientifically solid results, just exercise all the code that you're writing in the same way that "real" data would.

orbeckst avatar Jul 15 '25 19:07 orbeckst

I am going to assign myself as PR shepherd but if anyone else wants to take over or be involved, please do!

orbeckst avatar Jul 18 '25 20:07 orbeckst

With the new commit:

  • CHANGELOG has been edited
  • Docstrings has been edited
  • An initial Test has been added for test_msd.py

This initial test currently checks if results.timeseries and results.delta_t_values are matched with the expected value. (NOTE: results.msds_by_particle is yet to be implemented in the original msd.py code inside the new method)

It seems like some of the automated tests are failing although when I did pytest testsuite/MDAnalysisTests/analysis/test_msd.py locally it passed. I am a little confused about this.

Ran black after each.py codes are edited

gitsirsha avatar Jul 22 '25 16:07 gitsirsha

It seems from the errors that the filename needs to be added in datafiles.py > __all__

gitsirsha avatar Jul 22 '25 17:07 gitsirsha

It seems like some of the automated tests are failing although when I did pytest testsuite/MDAnalysisTests/analysis/test_msd.py locally it passed. I am a little confused about this.

fixed by adding the test file name to __all__ in datafiles.py. codecov/patch has increased to 100%.

The linter seems to giving errors for msd.py although black was run after edits.

gitsirsha avatar Jul 22 '25 18:07 gitsirsha

Did you run black with version 24? (Check black --version and if necessary pin to 24)

orbeckst avatar Jul 22 '25 19:07 orbeckst

Did you run black with version 24? (Check black --version and if necessary pin to 24)

I was running black v25. Downgraded and ran black 24.10.0 with [c90f52e]

gitsirsha avatar Jul 22 '25 19:07 gitsirsha

* revert black formatting for any files that you didn't touch — _only reformat the files that you've been working on_, please.

Can you please let me know how to revert back the changes for the recent commit locally. When I ran black v24.10.0 on msd.py it said 1 file unchanged So, I did black ., which changed every .py file.

Thank you for all the suggestions will get to those shortly!

gitsirsha avatar Jul 22 '25 20:07 gitsirsha

If you look at https://github.com/MDAnalysis/mdanalysis/pull/5066/files you see that 260(!) files were changed. I would look into reverting the commit that did that.

Alternatively, find the files where you definitely added code (msd.py. test_msd.py, CHANGELOG, AUTHORS, ... ?) and revert everything else. Look at git revert, git reset (be careful), and also git restore. I also saw "See "Reset, restore and revert" in git(1) for the differences between the three commands.".

orbeckst avatar Jul 22 '25 20:07 orbeckst

I added the tests to calculate msds for all dimensions and using the start/stop/step in separate functions inside a new class. Let me know if that's how it should be or if there's any way to do it in fewer lines of code for test_msd.py

gitsirsha avatar Jul 28 '25 20:07 gitsirsha

The tests look ok to me. The black-formatting is a bit annoying as it takes up so much vertical space but I don't know how to change it and it's not really important so let's leave it.

You're testing all your functionality, so that's great!

orbeckst avatar Jul 28 '25 20:07 orbeckst

Thank you! I will do it shortly. Yes I also saw the commit suggestion button but was a bit skeptical to use it as I thought I have to edit the changes locally once again. Thanks for the clarification.

gitsirsha avatar Jul 28 '25 20:07 gitsirsha

When you commit a suggestion, it is added to the branch on GitHub. You can then pull in these changes with git pull.

orbeckst avatar Jul 28 '25 20:07 orbeckst

I'm close to implementing results.msds_by_particle for now, it's just a null array. Once that's done I'll add the corresponding tests to the new class and functions following the same pattern.

gitsirsha avatar Jul 29 '25 00:07 gitsirsha

Future implementation in progress: The MSD.results.msds_by_particle (a 2D array) as mentioned in msds_by_particle is "The MSD of each individual particle with respect to lag-time", For uniform dumps this is very straightforward with each row of the array being the lag-time. Each lag (Δt) corresponds to a fixed number of frames, so we can neatly store MSD per particle, per lag-index.

However In the non-linear (irregular Δt) case: msds_by_particle is defined as the time-averaged MSD of each particle computed at actual Δt values (i.e., time differences between all frame pairs), rather than fixed frame lags. This accounts for non-uniform sampling and ensures MSD is evaluated based on physical time rather than frame index.

msds_by_particle[Δt_index, i] = ⟨|rᵢ(t + Δt) − rᵢ(t)|²⟩_t

In short for the _conclude_non_linear() method MSD.results.msds_by_particle will be a 2D array with rows being the numbers of unique delta_t values that can be obtained from MSD.results.delta_t_values and columns being the MSD for each particle for that specific delta_t. This is also consistent of the array structure mentioned ensuring easy plotting of particle specific msd over time-lags.

I will push once this implementation is done shortly and update you with the performance. Thank you again for your valuable reviews!

In the latest commit - i. msds_by_particle is added in msd.py and some tests are added. The results are huge so I though to keep it to xyz only. ii. code blocks for sorting delta_t keys and making average msd is reduced Screenshot 2025-07-28 at 10 10 44 PM iii. In the new method code can be conceptually separated in blocks - Initialization block, calculation block, results/averaging block (following correct format as per docs), actually storing the results in self.results.... block Screenshot 2025-07-28 at 10 14 53 PM

gitsirsha avatar Jul 29 '25 02:07 gitsirsha

Assigning self.results.timeseries = self.results.msds_by_particle.mean(axis=1) (as done in the two default methods) yields same result. This further ensures consistency in shape and how msds_by_particle is computed and propagated through the algorithm.

gitsirsha avatar Jul 29 '25 02:07 gitsirsha

The trade off of times between executing both the methods (simple and the non-linear) does not seem that much -

Here is a local performance comparison - test file = CG dump; N=60000; Frames=21

Default method: Screenshot 2025-07-28 at 10 38 15 PM Modified method: Screenshot 2025-07-28 at 10 38 52 PM

There's an internal loop in the interface code that loops over the dimensions that is also taking some time. Both takes about 1s less without the plots. I presume it is handling huge files well.

gitsirsha avatar Jul 29 '25 02:07 gitsirsha

Your test file is not packaged, see build_testsuite_sdist in https://github.com/MDAnalysis/mdanalysis/actions/runs/16632261268/job/47064413158?pr=5066

YOu'll need to add a line in testsuite/pyproject.toml after https://github.com/MDAnalysis/mdanalysis/blob/96f0b67f97d23b11c3ea627ec491efe87b3c900a/testsuite/pyproject.toml#L118 like

"data/analysis/msd/*", 

so that the files are included.

You can then manually build the dist file – I'd try

cd testsuite
python -m build

(mamba install python-build if needed) or you can try what the CI does (pipx run build --sdist --outdir dist testsuite). Look at the output to check that your file is included. You can also install the resulting dist file to check locally.

orbeckst avatar Jul 31 '25 17:07 orbeckst

I’ve updated the definitions and am happy to revise them further if a more concise/better version is suggested. Thank you!

gitsirsha avatar Aug 01 '25 14:08 gitsirsha

When reading the docs I also realized that there's the big warning note about "unwrapping" trajectories https://mdanalysis--5066.org.readthedocs.build/en/5066/documentation_pages/analysis/msd.html#computing-an-msd but that's outdated. I'll change it in this PR. You might just have to git pull before pushing.

orbeckst avatar Aug 01 '25 16:08 orbeckst

results.delta_t_values is now available for both simple and fft based methods. dump_times[1] - dump_times[0] for a linearly dumped file should be nothing but the "actual time difference between two frames", which is a constant!, previously was instructed to define via timestep=1 in the docs.

gitsirsha avatar Aug 15 '25 17:08 gitsirsha

Thank you for the review.

Thanks for the updates. I only have minor cosmetic changes.

While reviewing, I also realized that your code can solve a problem that has been inherent in the existing code: The run() method can take the frames=[0, 20, 25, 100, 1001, ...]` kwarg to select any frames instead of start/stop/step and if that's the case, the "linear" approach is incorrect (but that case is not being caught). We should select your new code path for un-even frame separations.

I am not asking for this to be resolved in this PR. If you agree that this is a problem and that your code could solve it, could you raise an issue for it so that it's not forgotten.

That is a very good observation and yes it is wrong to use the default method for such non-linearly spaced kwargs. I will surely raise an issue and I think just diverting the code path to the method for when such kwargs are present would be a good solution.

gitsirsha avatar Aug 16 '25 19:08 gitsirsha

A test can also be added and later modifications can be performed to work the issue out.

gitsirsha avatar Aug 16 '25 19:08 gitsirsha

I see that the two codecov tests haven't run. Is there any particular reason? Just curious

gitsirsha avatar Aug 16 '25 23:08 gitsirsha