mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

Enabling of parallelization of `analysis.hydrogenbonds.WaterBridgeAnalysis`

Open talagayev opened this issue 3 months ago • 17 comments

Fixes #4666.

Changes made in this Pull Request:

  • Addition of parallelization to WaterBridgeAnalysis class in analysis.hydrogenbonds.wbridge_analysis.
  • Addition of client_WaterBridgeAnalysis to conftest.py.
  • Adjustment of tests in test_wbridge.py to work with parallelization.

Here currently I adjusted analysis.hydrogenbonds.wbridge_analysis to work with parallelization and added an example of pytests that will show that it works.

The problem that I have encountered is that I had to modify universe_DA_PBC, since if I directly add the client_WaterBridgeAnalysis it will led to an infinite pytest on my local PC, mainly connected with us using here StringIO that appearantly has difficulties with parallelization.

The second thing is that currently the tests in test_wbridge.py are mainly with 1 frame in mind, so to showcase that parallelization works I could add additional tests similar like universe_DA_PBC_10frames().

PR Checklist

  • [x] Issue raised/referenced?
  • [ ] Tests updated/added?
  • [ ] Documentation updated/added?
  • [ ] package/CHANGELOG file updated?
  • [x] Is your name in package/AUTHORS? (If it is not, add it!)

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--5151.org.readthedocs.build/en/5151/

talagayev avatar Nov 20 '25 12:11 talagayev

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 92.72%. Comparing base (6b51eac) to head (3b4812f). :warning: Report is 1 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #5151   +/-   ##
========================================
  Coverage    92.72%   92.72%           
========================================
  Files          180      180           
  Lines        22458    22466    +8     
  Branches      3186     3187    +1     
========================================
+ Hits         20824    20832    +8     
  Misses        1177     1177           
  Partials       457      457           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 20 '25 12:11 codecov[bot]

@marinegor I would ping you in this PR, since it is parallelization :)

Here I would need your opinion on how to approach the pytest cases. I didn't add client_WaterBridgeAnalysis to the tests with StringIO since these would lead to the tests being infinite. The question is, is there any goaround for parallelization with StringIO or do we just modify the pytests to work?

Also as mentioned with the pytests being only for 1 frame, should I add for each test cases with multiple frames and test them as in the example that I have now or what would be the prefereable approach?

talagayev avatar Nov 20 '25 13:11 talagayev

@talagayev I don't see a reason why these files take StringIO instead of just a filename, hence I'd suggest re-writing these tests to accept filenames, and put respective filenames (ADA.gro, DWA.gro, ...) in e.g. testsuite/MDAnalysisTests/data or in a new folder testsuite/MDAnalysisTests/data/waterbridge. Then everything should work as for other classes, I guess.

marinegor avatar Nov 20 '25 19:11 marinegor

@talagayev I don't see a reason why these files take StringIO instead of just a filename, hence I'd suggest re-writing these tests to accept filenames, and put respective filenames (ADA.gro, DWA.gro, ...) in e.g. testsuite/MDAnalysisTests/data or in a new folder testsuite/MDAnalysisTests/data/waterbridge. Then everything should work as for other classes, I guess.

Agree, we can do them similar to the other cases with the files. I will create the test filles, put them into testsuite/MDAnalysisTests/data/waterbridge and adjust the pytests to use them. I will then do for now for the tests the case where I check for now for for each case a test with 1 frame, basically as they are now and a test for multiple frames, so similar like the added test_donor_accepter_pbc_multi, just name them more correspondingly to the tests with 1 frame.

talagayev avatar Nov 21 '25 00:11 talagayev

@talagayev ok, sure! just @ me here or request a review when you're done.

marinegor avatar Nov 26 '25 15:11 marinegor

@marinegor okay should be ready to be reviewed now, had to do adjust timesteps for parallelization cases, due to an error that was appearing, mainly that one that was appearing

analysis_func = <function TestWaterBridgeAnalysis.test_count_by_time_weight.<locals>.analysis at 0x000002322B345A80>, kwargs = {}, result = []

    def count_by_time(self, analysis_func=None, **kwargs):
        """Counts the number of water bridges per timestep.

        The counting behaviour can be adjusted by supplying analysis_func.
        See :ref:`wb_count_by_time` for details.

        Returns
        -------
        counts : list
             Returns a time series ``N(t)`` where ``N`` is the total
             number of observed water bridges at time ``t``.

        """
        if analysis_func is None:
            analysis_func = self._count_by_time_analysis
        if self.results.network:
            result = []
>           for time, frame in zip(self.timesteps, self.results.network):
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E           TypeError: 'NoneType' object is not iterable

..\..\..\package\MDAnalysis\analysis\hydrogenbonds\wbridge_analysis.py:1966: TypeError

So had to adjust timesteps to not be NoneType during parallelization.

As for the testfiles, I did create those that were created in the tests, put them in the data and import them similar as in the other tests.

talagayev avatar Nov 26 '25 19:11 talagayev

@marinegor Ah and I can update the CHANGELOG and add the Documentation if this looks good, wasn't sure if something needs to be modified so I kept it for last :)

talagayev avatar Nov 26 '25 23:11 talagayev

@talagayev

I'm slightly worried about these timesteps -- shouldn't they be actual timestamps, and not just consecutive integers?

I.e. if I modify lines 1954-ish it to:

        if self.results.network:
            result = []
            print(f'{self.timesteps=}')
            assert False
            for time, frame in zip(self.timesteps, self.results.network):

and run tests, I get [0.0] before assertion error.

I guess the issue here is that attribute timesteps are implicitly expected to exist after the run is finished, which is not the case when we run it in parallel. Hence they should be moved to self.results.timesteps if we rely on them in _conclude or elsewhere.

It's weirdly the only class that actually does this:

❯ rg -t py self.timesteps
package/MDAnalysis/analysis/hydrogenbonds/wbridge_analysis.py
1027:        self.timesteps = None  # time for each frame
1314:        self.timesteps = []
1409:        self.timesteps.append(self._ts.time)
1969:        if self.timesteps is None:
1972:            timesteps = self.timesteps
2036:            if self.timesteps is None:
2039:                timesteps = self.timesteps
2140:        for t, hframe in zip(self.timesteps, timeseries):

@MDAnalysis/coredevs do you have any opinion on moving timesteps to results.timesteps instead? If I'm understanding correctly, the right way to do that is to deprecate self.timesteps in favor of self.results.timesteps, while for now copying self.results.timesteps to timesteps.

marinegor avatar Nov 26 '25 23:11 marinegor

@talagayev

I'm slightly worried about these timesteps -- shouldn't they be actual timestamps, and not just consecutive integers?

I.e. if I modify lines 1954-ish it to:

        if self.results.network:
            result = []
            print(f'{self.timesteps=}')
            assert False
            for time, frame in zip(self.timesteps, self.results.network):

and run tests, I get [0.0] before assertion error.

I guess the issue here is that attribute timesteps are implicitly expected to exist after the run is finished, which is not the case when we run it in parallel. Hence they should be moved to self.results.timesteps if we rely on them in _conclude or elsewhere.

It's weirdly the only class that actually does this:

❯ rg -t py self.timesteps
package/MDAnalysis/analysis/hydrogenbonds/wbridge_analysis.py
1027:        self.timesteps = None  # time for each frame
1314:        self.timesteps = []
1409:        self.timesteps.append(self._ts.time)
1969:        if self.timesteps is None:
1972:            timesteps = self.timesteps
2036:            if self.timesteps is None:
2039:                timesteps = self.timesteps
2140:        for t, hframe in zip(self.timesteps, timeseries):

@MDAnalysis/coredevs do you have any opinion on moving timesteps to results.timesteps instead? If I'm understanding correctly, the right way to do that is to deprecate self.timesteps in favor of self.results.timesteps, while for now copying self.results.timesteps to timesteps.

True, yes was not sure about the solution, if it will work for other cases, that are not covered by the pytests 🙈 Yes should be timesteps, which is currently tricky to get during multiprocessing, which is as you mentioned that they have to exist after the run, which works well with serial, but difficult with multiprocessing.

hm, yes the moving to of timesteps to results.timesteps could be an option I think, would need to look more into it as well.

talagayev avatar Nov 27 '25 00:11 talagayev

I vaguely remember having this conversation already, actually.

I think #3720 and #3735 (reading changelog) implemented something like that, so we could in principle do this right away too🫣 But I need other coderev opinions.

marinegor avatar Nov 27 '25 00:11 marinegor

I vaguely remember having this conversation already, actually.

I think #3720 and #3735 (reading changelog) implemented something like that, so we could in principle do this right away too🫣 But I need other coderev opinions.

Yep, sounds good, let's see what the others say, in the meantime I can check the PR and how it was done there.

talagayev avatar Nov 27 '25 01:11 talagayev

timesteps is not part of the AnalysisBase API (as far as I can tell) so I don't see an issue putting it under results.

If it was documented as a top level attribute of the specific class before then you have to deprecate it and keep a pointer around until we can remove it in 3.0.

(Apologies, quick comment before I am now starting my 🦃 days.)

orbeckst avatar Nov 27 '25 01:11 orbeckst

also @talagayev you can have a look at InterRDS_s and DeprecationWarning there to have a look at how things get moved to results. I'd do something similar here as well.

marinegor avatar Nov 27 '25 21:11 marinegor

@talagayev why is StringIO not working here?

orbeckst avatar Dec 01 '25 16:12 orbeckst

I pasted traceback above -- seems that StringIO doesn't work well with multiprocessing.

marinegor avatar Dec 01 '25 17:12 marinegor

I pasted traceback above -- seems that StringIO doesn't work well with multiprocessing.

Could it be connected somehow to the case, that at some point in the parallelization if it uses ReaderBase to read the file it would try to read the file and would not be able to access it, since it gets only a string that is not pickable and iteratable due to this part of the code in the coordinates/base.py? Since here he only exception being NamedStream files.

    if isinstance(filename, NamedStream):
        self.filename = filename
    else:
        self.filename = str(filename)

maybe we can adjust it somehow for StringIO cases. How is it currently regulated with NamedStream cases? I assume those would possibly also fall into the same line of them being difficult to parallelize?

talagayev avatar Dec 01 '25 18:12 talagayev

@orbeckst @IAlibay do you have any feedback on moving StringIO-based tests for this PR?

I honestly see this as a separate issue, which we totally must raise (@talagayev please tell me if you wanna do that, otherwise I'll do it). But, I don't see why it would block this particular PR enabling parallelization.

@talagayev could you please implement all non-StringIO related changes requested by me and Irfan and re-request review when it's ready?

marinegor avatar Dec 14 '25 16:12 marinegor

Do you suggest to make the tests here run with file-based access and then revert to StringIO when it's fixed?

orbeckst avatar Dec 15 '25 15:12 orbeckst

I generally do not understand why particularly here tests are StringIO-based -- it doesn't seem to be that common afaik.

I'd suggest making them file-based (as the rest of the tests) and raising an issue about using StringIO topology/universe with non-serial backends.

marinegor avatar Dec 15 '25 16:12 marinegor

StringIO cuts down on the number of files to ship and it's neat when you only have mini-files that you can keep in-line.

orbeckst avatar Dec 15 '25 16:12 orbeckst

I agree, it reduces numbers of files indeed, and is more readable. But in this particular case I feel like passing tests for the class that actually should run on real-world data (=files) could have a higher priority. I'm not sure if StringIO is ever used in real-world scenarios, and it feels weird to me to block this PR because of the (inconsistent with other AnalysisBase subclasses) way tests are written here.

And indeed as you mentioned, we can indeed revert to StringIO when the issue with it in parallel context is resolved.

marinegor avatar Dec 15 '25 16:12 marinegor

@marinegor StringIO is very frequently used in the real world, please do not discount it as a second class citizen, otherwise a chunk of the ecosystem is going to fall apart.

IAlibay avatar Dec 15 '25 17:12 IAlibay

Sorry if I came across a bit too strong, I don't have problems with StringIO in general. But in this particular case, seems that

  • StringIO doesn't work for any non-serial run for any AnalysisBase subclass
  • this particular PR is correct, and fails because of the issue above

Hence I suggest not fixing StringIO issue here but rather raising it separately, while accepting this PR (and switching it to file-based testing).

marinegor avatar Dec 15 '25 17:12 marinegor

@orbeckst @IAlibay do you have any feedback on moving StringIO-based tests for this PR?

I honestly see this as a separate issue, which we totally must raise (@talagayev please tell me if you wanna do that, otherwise I'll do it). But, I don't see why it would block this particular PR enabling parallelization.

@talagayev could you please implement all non-StringIO related changes requested by me and Irfan and re-request review when it's ready?

Yup can do, but will quickly wait for an decision how we proceed with StringIO and can then adjust the code and create the Issue.

talagayev avatar Dec 15 '25 17:12 talagayev