mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

Add an `AnalysisCollection` class

Open PicoCentauri opened this issue 2 years ago • 19 comments

Fixes #3569

The AnalysisCollection follows the discussion of #3569 resetting the timestep object for each analysis by default. If requested the user can set the reset_timestep variable to False allowing altering the timestep object.

Within this PR I moved the run method from AnalysisBase to AnalysisCollection to avoid code a lot of duplication. These changes interface of Github might make this a bit difficult to follow. The core changes the run method only that instead of running the _prepare, _single_frame and _conclude methods only once it is looped over all provided analysis instances.

Changes made in this Pull Request:

  • Add an AnalysisCollection class to perform multiple analyses on the same trajectory
  • Use assert_equal in test_base.py

PR Checklist

  • [x] Tests?
  • [x] Docs?
  • [x] CHANGELOG updated?
  • [x] Issue raised/referenced?

Disclaimer: This PR was written with inspiration of OpenAI's ChatGPT.

PicoCentauri avatar Jan 30 '23 10:01 PicoCentauri

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 93.17%. Comparing base (347a0c0) to head (e762e6a).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4017      +/-   ##
===========================================
- Coverage    93.59%   93.17%   -0.43%     
===========================================
  Files          168       12     -156     
  Lines        21104     1069   -20035     
  Branches      3919        0    -3919     
===========================================
- Hits         19752      996   -18756     
+ Misses         894       73     -821     
+ Partials       458        0     -458     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jan 30 '23 10:01 codecov[bot]

If I use the code given in the example of class I got a speedup of ~30% compared to running each class individually. Quite good for such a simple change 🙂

PicoCentauri avatar Feb 16 '23 15:02 PicoCentauri

@richardjgowers I made you the responsible adult in the room ;-) — please shepherd the PR to completion

orbeckst avatar Feb 24 '23 22:02 orbeckst

Up for discussion, would it be a good idea to inherit abc.Collection class so that one can do e.g.

for analysis in collection:
    print(analysis.results)

Besides, it feels a bit weird to me that AnalysisBase is inherited from AnalysisCollection which is a collection of AnalysisBases...

yuxuanzhuang avatar Feb 25 '23 10:02 yuxuanzhuang

We could wither duplicate the logic, which I don't like. Or, create en external class that both the Base and the Collection use.

PicoCentauri avatar Feb 25 '23 18:02 PicoCentauri

We could wither duplicate the logic, which I don't like. Or, create en external class that both the Base and the Collection use.

Would it be a good idea to create an external Runner class for the run task and have e.g. SingleAnalysisRunner, CollectionAnalysisRunner (I'm not totally sure what the best design pattern is here) ? I'm also thinking about if we plan to implement parallel analysis based on it, it can be simply creating a new ParallelAnalysisRunner (or even ParallelCollectionAnalysisRunner) instead of doing complex inheritance.

yuxuanzhuang avatar Feb 27 '23 09:02 yuxuanzhuang

Would it be a good idea to create an external Runner class for the run task and have e.g. SingleAnalysisRunner, CollectionAnalysisRunner (I'm not totally sure what the best design pattern is here) ? I'm also thinking about if we plan to implement parallel analysis based on it, it can be simply creating a new ParallelAnalysisRunner (or even ParallelCollectionAnalysisRunner) instead of doing complex inheritance.

I am fine with creating these two classes. To avoid code duplication the SingleAnalysisRunner will inherit from the CollectionAnalysisRunner and just call the run method for a single instance.

Also, should they rather be private classes? I think users will not use them because these runners will require an already prepared analysis class.

PicoCentauri avatar Feb 27 '23 09:02 PicoCentauri

We could wither duplicate the logic, which I don't like. Or, create en external class that both the Base and the Collection use.

I think this solution is the cleanest and doesn't require even more classes which I don't like the idea of. IMO there is already enough of a barrier to entry without every new analysis having to handle 2x new API points.

Personally I would just have an abc abstact base class that they can both inherent from.

hmacdope avatar Mar 05 '23 02:03 hmacdope

Yes, we can do an ABC class if the others also agree.

PicoCentauri avatar Mar 05 '23 10:03 PicoCentauri

Thanks to all of you for your comments and suggestions and sorry for my later answer.

I agree that the current design is a bit confusing. I thought about the ABC abstract base class. I think this does not really fit here. The reason is that AnalysisCollection works with and on AnalysisBase instances.

A cleaner way for reading the code would be the idea that @yuxuanzhuang suggested that we provide a (private ?) _Runner class that basically is the code of the AnalysisCollection and the setup_frames. This _Runner will analyze one or more AnalysisBase instances. What do you think?

Regarding some questions and comments:

I think this solution is the cleanest and doesn't require even more classes which I don't like the idea of. IMO there is already enough of a barrier to entry without every new analysis having to handle 2x new API points.

This was never the idea. There will only be one API point for new Analysis classes.

@PicoCentauri have you thought about how results could be accessed from the AnalysisCollection object? Or is the intent to always access results from the individual classes again?

The latter. I think accessing the results from the individual instances is already handy. An additional way would be just confusing and blowing up the code.

PicoCentauri avatar Mar 23 '23 16:03 PicoCentauri

@orbeckst @hmacdope I would still like to get this feature into main. Before I rebase again I would like to ask for your opinion again. I think the biggest issue to be discussed is the inheritance of AnalysisBase and AnalysisCollection.

PicoCentauri avatar Aug 28 '23 11:08 PicoCentauri

@PicoCentauri sorry for the following short comment... I've run out of "open source time" for the day: There's a longish discussion on parallel analysis in #4158 and it would be good to hear your opinion how the proposed feature could/should work with an AnalysisBase that would try to parallelize with split-apply-combine (see PMDA https://doi.org/10.25080/Majora-7ddc1dd1-013 for the simple idea).

orbeckst avatar Aug 28 '23 21:08 orbeckst

Hi @PicoCentauri , thanks so much for the PR, it's a neat idea!

Regarding of being compatible with #4158, I feel like you'd need to change the _compute method of your AnalysisCollection class instead of run, and the rest will be more or less the same.

And I'd probably indeed subclass AnalysisBase in this case -- you'd need to re-implement __init__, _single_frame, _prepare and _conclude, and run will be then implemented automatically, with an option to be executed on multiple backends too.

marinegor avatar Aug 31 '23 16:08 marinegor

Thanks @orbeckst and @marinegor for the comments.

Regarding of being compatible with https://github.com/MDAnalysis/mdanalysis/issues/4158, I feel like you'd need to change the _compute method of your AnalysisCollection class instead of run, and the rest will be more or less the same.

Changing to _compute is an easy fix and I could do it..

And I'd probably indeed subclass AnalysisBase in this case -- you'd need to re-implement init, _single_frame, _prepare and _conclude, and run will be then implemented automatically, with an option to be executed on multiple backends too.

I am not sure If I get your comment correctly. You would subclass AnalysisBase ? But this means we don't have reimplement the methods. Maybe you meant to NOT subclass AnalysisBase?

PicoCentauri avatar Sep 22 '23 09:09 PicoCentauri

@PicoCentauri I indeed mean "subclass AnalysisBase" here, and re-implement the methods that need implementation in subclasses -- otherwise, you need to manually add run method to it, and make sure its semantics is maintained the same with AnalysisBase (I assume this was your intention).

But if you subclass AnalysisBase here, you only need to add _single_frame, _prepare and _conclude methods. _single_frame would run _single_frame of all analysis objects from the collection, _prepare would initialize their _prepare methods, and _conclude would aggregate the computed results back.

marinegor avatar Sep 22 '23 11:09 marinegor

Hello @PicoCentauri! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2024-06-05 16:10:06 UTC

pep8speaks avatar Jun 05 '24 15:06 pep8speaks

Linter Bot Results:

Hi @PicoCentauri! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/9387570012/job/25850803205


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

github-actions[bot] avatar Jun 05 '24 15:06 github-actions[bot]

Sorry for the lack of updates for a while. I addressed your concerns and moved the implementation of the run method into a standalone function that is used by the AnalysisBase as well as the AnalysisCollection.

PicoCentauri avatar Jun 05 '24 15:06 PicoCentauri

hey @PicoCentauri , the parallelization PR that we've mentioned earlier has been merged into develop (https://github.com/MDAnalysis/mdanalysis/pull/4162), and your AnalysisCollection would be extremely helpful here. Reason is, parallel analysis doesn't provide any real speedup unless processing of a single frame is noticeably slower that reading it from the disk -- but if we're running an AnalysisCollection, it's almost guaranteed!

Could you perhaps have a look at your code after you merge develop into it? I can't yet see how your approach would interact with the current parallelization protocol (you can look at the diagram here: https://docs.mdanalysis.org/dev/documentation_pages/analysis/parallelization.html#split-apply-combine), but it might actually work well.

If this doesn't work, perhaps I'd suggest implementing AnalysisCollection as AnalysisBase subclass, and in its _single_frame running each subclass's _single_frame() and manually updating the trajectory -- this way, you will likely get the parallelization with multiprocessing as soon as you figure out how to aggregate the results. But it's more tedious, and I'm happy to discuss it in more details if you want.

marinegor avatar Aug 17 '24 06:08 marinegor