mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

Preserve Full Trajectory Information in Parallel Analysis

Open yuxuanzhuang opened this issue 10 months ago • 5 comments

Is your feature request related to a problem?

In analyses e.g. DistanceMatrix in diffusionmap, we need to run analysis of a single frame over the full trajectory---that is sliced and defined by start, stop, and step in run(). However, in the current parallel analysis implementation, information about the full trajectory is lost (only per-process slices are visible).

Describe the solution you’d like

Firstly, in _setup_frames (which only runs in the main process), store the sliced trajectory information in self._global_slicer. This makes it possible to retrieve global details such as the total number of frames.

Secondly, in self._compute, also store self._global_frame_index so the analysis can track the global frame index for each frame, rather than just per-process slices.

The following code will illustrate its usage

import MDAnalysis as mda 
from MDAnalysisTests.datafiles import PDB, XTC
import numpy as np
from MDAnalysis.analysis.base import AnalysisBase
from MDAnalysis.analysis.results import ResultsGroup

u = mda.Universe(PDB, XTC)

class MyAnalysis(AnalysisBase):
    _analysis_algorithm_is_parallelizable = True

    @classmethod
    def get_supported_backends(cls):
        return ('serial', 'multiprocessing', 'dask',)

    def _prepare(self):
       self.results.frame_index = []
       self.results.global_frame_index = []
       self.results.n_frames = []
       self.results.global_n_frames = []

    def _single_frame(self):
       frame_index = self._frame_index
       global_frame_index = self._global_frame_index

       self.results.frame_index.append(frame_index)
       self.results.global_frame_index.append(global_frame_index)

       self.results.n_frames.append(self.n_frames)

       global_n_frames = len(self._trajectory[self._global_slicer])
       self.results.global_n_frames.append(global_n_frames)

    def _get_aggregator(self):
      return ResultsGroup(lookup={'frame_index': ResultsGroup.flatten_sequence,
                                  'global_frame_index': ResultsGroup.flatten_sequence,
                                  'n_frames': ResultsGroup.flatten_sequence,
                                  'global_n_frames': ResultsGroup.flatten_sequence})

# serial analysis
ana = MyAnalysis(u.trajectory)
ana.run(step=2)
ana.results
> {'frame_index': [0, 1, 2, 3, 4], 'global_frame_index': [0, 1, 2, 3, 4], 'n_frames': [5, 5, 5, 5, 5], 'global_n_frames': [5, 5, 5, 5, 5]}

# parallel analysis
ana = MyAnalysis(u.trajectory)
ana.run(step=2, backend='dask', n_workers=2)
ana.results
> {'frame_index': [0, 1, 2, 0, 1], 'global_frame_index': [0, 1, 2, 3, 4], 'n_frames': [3, 3, 3, 2, 2], 'global_n_frames': [5, 5, 5, 5, 5]}

n_frames and frame_index are not the same for serial and parallel analysis

Related PR

https://github.com/MDAnalysis/mdanalysis/pull/4745

yuxuanzhuang avatar Jan 13 '25 23:01 yuxuanzhuang

@yuxuanzhuang just wanted to say that I saw the PR and plan to review it tomorrow or on Saturday.

as for the _global part, I can think of two solutions:

  • have them implemented as properties with some documentation (instead of pure attributes), since as I understand, the current solution adds attributes and doesn't really allow to e.g. read the documentaion on them directly (instead you have to go read the actual docs, which is one step further away from jupyter's run.attribute?)
  • add an attribute constants or something, that would keep track of all readonly properties that the reader assigns during frame reading, and list all of them there

The last one is a bigger change though, but it also seems that there are a lot of different properties already, that might be worth unifying under a common class.

marinegor avatar Jan 15 '25 18:01 marinegor

I like the idea of the global constants (or other name... "COMMON", anyone ;-) ), similar to results, as a name space for global variables. At least then it looks less messy.

orbeckst avatar Feb 05 '25 01:02 orbeckst

Hi @yuxuanzhuang,

my apologies for delaying the review. I have looked at the PR and decided to take the discussion here.

First, I completely understand the rationale behind this, and I agree that it must be implemented. However, there are indeed some issues with the naming, but perhaps generally with how AnalysisBase separates run constants from all other attributes that might come along (it really doesn't, only with _).

Ideally (and also recalling my GSOC days), I'd love to have all the current-timestamp-related attributes in one namespace (either as a nested attribute, or simply with the same prefix), but that is too much to ask, and will break stuff. But we still can do it here -- the _global prefix is actually a good step in that direction.

My problem with _global was that at first, I didn't really understand, if this index/slicer is global (i.e. "for everything"), what is this "everything". Since it represents the parameters of the whole run, I'd simply call it _run_slicer and _run_frame_index, because that's what it's defined by -- the run parameters.

A sub-suggestion is to have something like _computational_group_frame_index and _computational_group_slicer, though that's way too wordy and also duplicates existing attributes. But if you have some suggestions here, would be double cool.

But, I'd approach the documentation from that run/computational-group standpoint: there are two groups of parameters, whole-run-related and computational-group-related, and I'd explicitly list them as such in the docs.


So action points:

  • rename _global_... to _run_...
  • add documentation section explicitly listing computational group attributes (all old ones) vs whole run attributes (_run_... ones)
  • long-term: think about how implement this run/computation-group namespaces

For the last one, perhaps making indeed self.constants.{run,computational_group} namespaces would be ok, and then the computational_group ones would link to existing attributes, while run would be new. But generally I'm ok with having things as they are.

marinegor avatar Feb 19 '25 23:02 marinegor

Oh, and the other thing -- if the sole purpose of that is to allow people who write custom parallelization code to iterate through the trajectory independently of the current slicer, why don't you write it as a custom method, instead of making them do self._trajectory[self._global_slicer]? That could be slightly cleaner, I guess, and also would have documentation, whereas the indexing of trajectory paradigm would be less discoverable in the docs.

marinegor avatar Feb 20 '25 09:02 marinegor

I implemented two dataclass in #4892 to store run-related information.

@dataclass(frozen=True)
class RunConfig:
    """Stores user-provided arguments for `run()`."""
    start: Optional[int] = None
    stop: Optional[int] = None
    step: Optional[int] = None
    frames: Optional[np.ndarray] = None
    backend: Optional[Union[str, BackendBase]] = None
    n_workers: Optional[int] = None
    n_parts: Optional[int] = None
    unsupported_backend: bool = False

@dataclass
class RunState:
    """Stores runtime-generated attributes that can be used
    during the analysis."""
    slicer: Optional[Union[slice, np.ndarray]] = None
    n_frames: Optional[int] = None
    computation_groups: Optional[List[np.ndarray]] = None
    frame_index: Optional[int] = None

This allows each computational group to access these constant parameters and runtime variables more intuitively. Would it be clearer?

I think it is also possible to create a ComputationalGroupState (or Constant? I am bad at naming :) ) to store per-group information or at least we should document that the current self.n_frames, self._frame_index stores information of one computational group.

Additionally, it provides the benefit of improved debugging, as these details are now retained instead of being lost.

yuxuanzhuang avatar Mar 18 '25 18:03 yuxuanzhuang