mdanalysis
mdanalysis copied to clipboard
Add an `AnalysisCollection` class
Fixes #3569
The AnalysisCollection follows the discussion of #3569 resetting the timestep object for each analysis by default. If requested the user can set the reset_timestep variable to False allowing altering the timestep object.
Within this PR I moved the run method from AnalysisBase to AnalysisCollection to avoid code a lot of duplication. These changes interface of Github might make this a bit difficult to follow. The core changes the run method only that instead of running the _prepare, _single_frame and _conclude methods only once it is looped over all provided analysis instances.
Changes made in this Pull Request:
- Add an
AnalysisCollectionclass to perform multiple analyses on the same trajectory - Use
assert_equalintest_base.py
PR Checklist
- [x] Tests?
- [x] Docs?
- [x] CHANGELOG updated?
- [x] Issue raised/referenced?
Disclaimer: This PR was written with inspiration of OpenAI's ChatGPT.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 93.17%. Comparing base (
347a0c0) to head (e762e6a).
Additional details and impacted files
@@ Coverage Diff @@
## develop #4017 +/- ##
===========================================
- Coverage 93.59% 93.17% -0.43%
===========================================
Files 168 12 -156
Lines 21104 1069 -20035
Branches 3919 0 -3919
===========================================
- Hits 19752 996 -18756
+ Misses 894 73 -821
+ Partials 458 0 -458
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
If I use the code given in the example of class I got a speedup of ~30% compared to running each class individually. Quite good for such a simple change 🙂
@richardjgowers I made you the responsible adult in the room ;-) — please shepherd the PR to completion
Up for discussion, would it be a good idea to inherit abc.Collection class so that one can do e.g.
for analysis in collection:
print(analysis.results)
Besides, it feels a bit weird to me that AnalysisBase is inherited from AnalysisCollection which is a collection of AnalysisBases...
We could wither duplicate the logic, which I don't like. Or, create en external class that both the Base and the Collection use.
We could wither duplicate the logic, which I don't like. Or, create en external class that both the Base and the Collection use.
Would it be a good idea to create an external Runner class for the run task and have e.g. SingleAnalysisRunner, CollectionAnalysisRunner (I'm not totally sure what the best design pattern is here) ? I'm also thinking about if we plan to implement parallel analysis based on it, it can be simply creating a new ParallelAnalysisRunner (or even ParallelCollectionAnalysisRunner) instead of doing complex inheritance.
Would it be a good idea to create an external Runner class for the run task and have e.g. SingleAnalysisRunner, CollectionAnalysisRunner (I'm not totally sure what the best design pattern is here) ? I'm also thinking about if we plan to implement parallel analysis based on it, it can be simply creating a new ParallelAnalysisRunner (or even ParallelCollectionAnalysisRunner) instead of doing complex inheritance.
I am fine with creating these two classes. To avoid code duplication the SingleAnalysisRunner will inherit from the CollectionAnalysisRunner and just call the run method for a single instance.
Also, should they rather be private classes? I think users will not use them because these runners will require an already prepared analysis class.
We could wither duplicate the logic, which I don't like. Or, create en external class that both the Base and the Collection use.
I think this solution is the cleanest and doesn't require even more classes which I don't like the idea of. IMO there is already enough of a barrier to entry without every new analysis having to handle 2x new API points.
Personally I would just have an abc abstact base class that they can both inherent from.
Yes, we can do an ABC class if the others also agree.
Thanks to all of you for your comments and suggestions and sorry for my later answer.
I agree that the current design is a bit confusing. I thought about the ABC abstract base class. I think this does not really fit here. The reason is that AnalysisCollection works with and on AnalysisBase instances.
A cleaner way for reading the code would be the idea that @yuxuanzhuang suggested that we provide a (private ?) _Runner class that basically is the code of the AnalysisCollection and the setup_frames. This _Runner will analyze one or more AnalysisBase instances. What do you think?
Regarding some questions and comments:
I think this solution is the cleanest and doesn't require even more classes which I don't like the idea of. IMO there is already enough of a barrier to entry without every new analysis having to handle 2x new API points.
This was never the idea. There will only be one API point for new Analysis classes.
@PicoCentauri have you thought about how results could be accessed from the AnalysisCollection object? Or is the intent to always access results from the individual classes again?
The latter. I think accessing the results from the individual instances is already handy. An additional way would be just confusing and blowing up the code.
@orbeckst @hmacdope I would still like to get this feature into main. Before I rebase again I would like to ask for your opinion again. I think the biggest issue to be discussed is the inheritance of AnalysisBase and AnalysisCollection.
@PicoCentauri sorry for the following short comment... I've run out of "open source time" for the day: There's a longish discussion on parallel analysis in #4158 and it would be good to hear your opinion how the proposed feature could/should work with an AnalysisBase that would try to parallelize with split-apply-combine (see PMDA https://doi.org/10.25080/Majora-7ddc1dd1-013 for the simple idea).
Hi @PicoCentauri , thanks so much for the PR, it's a neat idea!
Regarding of being compatible with #4158, I feel like you'd need to change the _compute method of your AnalysisCollection class instead of run, and the rest will be more or less the same.
And I'd probably indeed subclass AnalysisBase in this case -- you'd need to re-implement __init__, _single_frame, _prepare and _conclude, and run will be then implemented automatically, with an option to be executed on multiple backends too.
Thanks @orbeckst and @marinegor for the comments.
Regarding of being compatible with https://github.com/MDAnalysis/mdanalysis/issues/4158, I feel like you'd need to change the _compute method of your AnalysisCollection class instead of run, and the rest will be more or less the same.
Changing to _compute is an easy fix and I could do it..
And I'd probably indeed subclass AnalysisBase in this case -- you'd need to re-implement init, _single_frame, _prepare and _conclude, and run will be then implemented automatically, with an option to be executed on multiple backends too.
I am not sure If I get your comment correctly. You would subclass AnalysisBase ? But this means we don't have reimplement the methods. Maybe you meant to NOT subclass AnalysisBase?
@PicoCentauri I indeed mean "subclass AnalysisBase" here, and re-implement the methods that need implementation in subclasses -- otherwise, you need to manually add run method to it, and make sure its semantics is maintained the same with AnalysisBase (I assume this was your intention).
But if you subclass AnalysisBase here, you only need to add _single_frame, _prepare and _conclude methods.
_single_frame would run _single_frame of all analysis objects from the collection, _prepare would initialize their _prepare methods, and _conclude would aggregate the computed results back.
Hello @PicoCentauri! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
Comment last updated at 2024-06-05 16:10:06 UTC
Linter Bot Results:
Hi @PicoCentauri! Thanks for making this PR. We linted your code and found the following:
Some issues were found with the formatting of your code.
| Code Location | Outcome |
|---|---|
| main package | ⚠️ Possible failure |
| testsuite | ⚠️ Possible failure |
Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/9387570012/job/25850803205
Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!
Sorry for the lack of updates for a while. I addressed your concerns and moved the implementation of the run method into a standalone function that is used by the AnalysisBase as well as the AnalysisCollection.
hey @PicoCentauri , the parallelization PR that we've mentioned earlier has been merged into develop (https://github.com/MDAnalysis/mdanalysis/pull/4162), and your AnalysisCollection would be extremely helpful here. Reason is, parallel analysis doesn't provide any real speedup unless processing of a single frame is noticeably slower that reading it from the disk -- but if we're running an AnalysisCollection, it's almost guaranteed!
Could you perhaps have a look at your code after you merge develop into it? I can't yet see how your approach would interact with the current parallelization protocol (you can look at the diagram here: https://docs.mdanalysis.org/dev/documentation_pages/analysis/parallelization.html#split-apply-combine), but it might actually work well.
If this doesn't work, perhaps I'd suggest implementing AnalysisCollection as AnalysisBase subclass, and in its _single_frame running each subclass's _single_frame() and manually updating the trajectory -- this way, you will likely get the parallelization with multiprocessing as soon as you figure out how to aggregate the results. But it's more tedious, and I'm happy to discuss it in more details if you want.