mdanalysis Parallelizes `MDAnalysis.analysis.InterRDF` and `MDAnalysis.analysis.InterRDF

Fixes #4675

Changes made in this Pull Request:

Parallized both rdf.InterRDF and rdf.InterRDF_s

TLDR: of the comments below: Initially I thought rdf isnt parallizable but turns out both classes in rdf can be parallelized.

PR Checklist

[x] Tests?
[x] Docs?
[x] CHANGELOG updated?
[x] Issue raised/referenced?

Developers certificate of origin

[x] I certify that this contribution is covered by the LGPLv2.1+ license as defined in our LICENSE and adheres to the Developer Certificate of Origin.

📚 Documentation preview 📚: https://mdanalysis--4884.org.readthedocs.build/en/4884/

Jan 07 '25 19:01 tanishy7777

Hello @tanishy7777! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file package/MDAnalysis/analysis/rdf.py:

Line 592:1: W293 blank line contains whitespace

In the file testsuite/MDAnalysisTests/analysis/test_rdf.py:

Line 160:1: E302 expected 2 blank lines, found 1

In the file testsuite/MDAnalysisTests/analysis/test_rdf_s.py:

Line 179:1: E302 expected 2 blank lines, found 1

Comment last updated at 2025-01-20 20:35:32 UTC

Jan 07 '25 19:01 pep8speaks

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 93.89%. Comparing base (519ac56) to head (79601e6). :warning: Report is 27 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #4884   +/-   ##
========================================
  Coverage    93.88%   93.89%           
========================================
  Files          180      180           
  Lines        22422    22447   +25     
  Branches      3186     3188    +2     
========================================
+ Hits         21052    21077   +25     
  Misses         902      902           
  Partials       468      468

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Jan 07 '25 19:01 codecov[bot]

@tanishy7777 thanks for the prompt PR! I have some comments though:

self.volume_cum is cummulated across the frames so we cant parallize simply using the split-apply-combine technique.

Can we just sum it up separately though? I mean, make it a part of self.results for each worker's trajectory, and in _conclude write sum of it to self.volume_cum.

Jan 08 '25 12:01 marinegor

Can we just sum it up separately though? I mean, make it a part of self.results for each worker's trajectory, and in _conclude write sum of it to self.volume_cum.

Got it! I have made analysis.rdf.InterRDF parallizable with this approach but analysis.rdf.InterRDF_s needs a bit more work.

Jan 08 '25 17:01 tanishy7777

when trying to make analysis.rdf.InterRDF_s parallizable I am running into an error with aggregating results.count.

FAILED testsuite\MDAnalysisTests\analysis\test_rdf_s.py::test_nbins[client_InterRDF_s1] - ValueError: setting an array element with a 
sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous p...

I tried to find out why this is happening by looking at the result itself which is being aggregated along 'count'

so the arrs that is passed to the ResultsGroup.ndarray_vstack function as shown below can be broken into arrs = [arr, arr1] (refere imgs)

so I tried finding the dimensions of the arrays manually because it wasnt converting to a numpy array.

so its not able to convert to a numpy array, because of inconsistent dimensions. I am not sure how to resolve this

here arr is basically made up of of 2 arrays of (1,2,412) and (2,2,412) and arr1 is also made up of of 2 arrays of (1,2,412) and (2,2,412)

Jan 08 '25 19:01 tanishy7777

based on the above comment https://github.com/MDAnalysis/mdanalysis/pull/4884#issuecomment-2578422080

I think we can mark analysis.rdf.InterRDF_s as non parallizable and mark analysis.rdf.InterRDF as parallizable

because its not possible to convert array of inhomogenous dimensions to a numpy array and since rdf.InterRDF_s needs numpy arrays to be passed. so even if we concatenate the arrays as normal lists performing even basic operations would be hard.

Because after the _get_aggregator method when _conclude is run operations like

self.results.count[i] / norm

wont be possible as division is not supported between list and int

Jan 08 '25 19:01 tanishy7777

@tanishy7777 the class should be marked as non-parallelizable only if the algorithm to run it is not actually parallelizable, which I'm not yet convinced is the case for all the mentioned classes.

But I think you're on the right path here, you just need to implement a custom aggregation function, instead of those implemented among ResultsGroup staticmethods -- they obviously don't cover all the possible cases for aggregation, just the basic ones for user's convenience.

Can you describe what kind of arrays you're trying to aggregate and can not find an appropriate function for? I didn't quite get it from your screenshots.

Jan 09 '25 22:01 marinegor

But I think you're on the right path here, you just need to implement a custom aggregation function, instead of those implemented among ResultsGroup staticmethods -- they obviously don't cover all the possible cases for aggregation, just the basic ones for user's convenience.

Can you describe what kind of arrays you're trying to aggregate and can not find an appropriate function for? I didn't quite get it from your screenshots.

So, the array is something like this

The [412] denotes an array with 412 elements.

That is why it cant be processed by numpy directly. But I think if I modify it, using a custom comparator like you mentioned I can sum the entries(by adding the 3 blue arrays of shape 2x412 as in the picture) and convert it to a 1x2x412 array I think? I am not sure about the final dimension it needs to be converted to.

@marinegor

Jan 13 '25 20:01 tanishy7777

I am not sure about the final dimension it needs to be converted to

it should be the same as if you'd run it without parallelization. And I assume you want to stack/sum/whatever along the dimension that corresponds to the timestep -- you can probably guess which one it is if you run it on some example with known number of frames. Example trajectories you can find in MDAnalysisTests.

Jan 13 '25 20:01 marinegor

it should be the same as if you'd run it without parallelization. And I assume you want to stack/sum/whatever along the dimension that corresponds to the timestep -- you can probably guess which one it is if you run it on some example with known number of frames. Example trajectories you can find in MDAnalysisTests.

Got it. Will work on that!

Jan 13 '25 20:01 tanishy7777

it should be the same as if you'd run it without parallelization. And I assume you want to stack/sum/whatever along the dimension that corresponds to the timestep -- you can probably guess which one it is if you run it on some example with known number of frames. Example trajectories you can find in MDAnalysisTests

I tried making the custom_aggregator but some tests are still failing

These 2 lines specifically are causing the errors in all the 12 tests assert_allclose(max(rdf.results.rdf[0][0][0]), value) assert rdf.results.edges[0] == rmin

I am not sure how to resolve this, I will dig through the docs a bit more. Will update if I can figure it out.

Jan 18 '25 11:01 tanishy7777

used a custom aggregator for results.counts and converted the aggregation method for results.edges and results.bins to ResultsGroup.ndarray_mean

TLDR: Parallized both InterRDF and InterRDF_s classes in rdf.py

Thanks @marinegor for your help!

Jan 18 '25 16:01 tanishy7777

@orbeckst @RMeli @marinegor I think this is ready to be merged, can you please review it?

Jan 21 '25 12:01 tanishy7777

hi @tanishy7777, sorry for the long review (again).

good work, thanks for your contribution! I've added my comments, main action points below:

move your custom aggregation function from the class to a standalone function, name appropriately and test

revert changes to test_xds.py and core/selection.py (I'm guessing they were introduced by black or smth)

make sure you don't need to track self.volume_cum, since you're tracking self.results.volume_cum and assigning self.volume_cum in _conclude. I made suggestions regarding that but might have missed something; please make sure until _conclude only self.results.volume_cum is used.

Sorry for the late reply. I will get to work on these changes! I had semester examinations so was quite busy the last 2 weeks.

Mar 14 '25 20:03 tanishy7777

@marinegor Sorry for the delayed action on this PR again, college has been quite hectic.

I have added the changes that you had requested. For the formatting issues caused by black, I merged develop into my branch so I think it should be resolved.

Please let me know if any further changes are needed. Thanks for reviewing my changes and for your guidance!

Apr 01 '25 12:04 tanishy7777

@tanishy7777 thanks for all your contributions. As you know, reviewer time is pretty precious. I asked @marinegor to look at PR #4896 already so please be understanding if this one may take longer.

Apr 12 '25 20:04 orbeckst

@tanishy7777 thanks for all your contributions. As you know, reviewer time is pretty precious. I asked @marinegor to look at PR #4896 already so please be understanding if this one may take longer.

Thanks for you reply. Completely understandable, no worries at all.

Apr 12 '25 21:04 tanishy7777

@tanishy7777 just as a heads-up: Egor told us he has very little availability, so we'll have to find other reviewers to continue here.

As the first step where you can help is to just resolve the conflict to get the PR back into mergeable state.

May 09 '25 16:05 orbeckst

@p-j-smith would you be in a position to review and look after this parallelization PR? Egor has already done a lot of deep reviewing here but as he can't do it at the moment and we'd like to move this improvement along, it would be great to have your 👀 on it.

I am tentatively assigning you but if you can't do it, please un-assign yourself and let me know with a comment + ping and then I'll be looking for someone else. Thanks!!

May 09 '25 16:05 orbeckst

Thank you @tanishy7777 for the work and 🎉 ! Thank you @marinegor for the initial reviews and thank you @p-j-smith for getting it merged!

Oct 08 '25 21:10 orbeckst

While looking at the RDF PR #5073 I realized that some of the CHANGELOG and versionchanged entries on this PR were outdated. I fixed them as part of finalizing the other PR in https://github.com/gitzhangch/mdanalysis/commit/477b0e2efc8271f70b6b2fd318b8033e46bd5ee5

Oct 09 '25 00:10 orbeckst

some of the CHANGELOG and versionchanged entries on this PR were outdated

Sorry for missing that, and thanks for fixing!

Oct 09 '25 05:10 p-j-smith

No problem :-)

Oct 09 '25 19:10 orbeckst

mdanalysis
mdanalysis copied to clipboard

Parallelizes `MDAnalysis.analysis.InterRDF` and `MDAnalysis.analysis.InterRDF_s`

PR Checklist

Developers certificate of origin

Comment last updated at 2025-01-20 20:35:32 UTC

Codecov Report

mdanalysis mdanalysis copied to clipboard

Parallelizes `MDAnalysis.analysis.InterRDF` and `MDAnalysis.analysis.InterRDF_s`

PR Checklist

Developers certificate of origin

Comment last updated at 2025-01-20 20:35:32 UTC

Codecov Report

mdanalysis
mdanalysis copied to clipboard