NiMARE icon indicating copy to clipboard operation
NiMARE copied to clipboard

Refactor PairwiseEstimators to implement caching (only use one dataset instead of two)

Open jdkent opened this issue 3 years ago • 5 comments

Summary

I would find it easier to cache the ma_maps if dataset object remained consistent across calls to meta_estimator.fit instead of shuffling the studies between two datasets. perhaps a separate argument or a column in the one of the dataset dataframes could indicate the grouping?

Additional details

there would be one call to _collect_ma_maps (instead of two) where all the ma_maps would be loaded during the first run and then the result could be cached for future calls. Currently this would be hard since the two datasets change which studies are in each dataset: https://github.com/neurostuff/NiMARE/blob/d0935dae85394c7f45f1e9a8e84845ffba270c47/nimare/meta/cbma/mkda.py#L205-L214

Next steps

  • refactor pairwise estimators to use one dataset
  • cache the result of getting all the MA maps.

@tsalo @tyarkoni: thoughts?

jdkent avatar Apr 08 '21 22:04 jdkent

This sounds like what we had a while back, wherein you would feed in Dataset to Estimator.__init__() and a list of ids to Estimator.fit(). We moved away from that approach in #155, based on changes proposed in #115. The discussion in #115 is pretty dense so I'll need to look through it again to refresh my memory.

tsalo avatar Apr 09 '21 02:04 tsalo

I think ultimately the right solution to this problem is to represent the data in a more structured way that closely aligns with the NIMADS spec. I.e., Dataset would be a lightweight container containing Study objects; Study objects would contain Image objects, and so on. If we added a caching layer on the intensive operations (e.g., transforming nifti volumes into in-memory masked arrays), then reshaping those into pandas DFs or numpy arrays on the fly would be pretty cheap. Note that in that case, so long as new Dataset instances were created from old ones, references to existing Study or Image objects would be maintained, so there would be much less copying/slicing etc.

I think that's the cleanest way to do it, because if the caching is bound to images, then we automatically get the benefits anywhere in the package (masked) images are used, and don't have to think about it in specific analyses, or at Dataset level, etc.

tyarkoni avatar Apr 09 '21 14:04 tyarkoni

Oh, but re: having a single Dataset instead of two: I think if we want to do it that way, we would probably need to introduce a new abstraction and conceptually separate dataset in the sense of "arbitrary collection of studies amenable to meta-analysis" from dataset in the sense of "a particular assignment of coordinates or images to one or more conditions used in a specific meta-analysis". This is already where we're planning to go with the front-end suite, so it would be very reasonable to mirror it in the back-end code.

The core idea then is that we introduce something like a CodedData object (that's a crappy name, but just to make the point) that references a Dataset and maps some of its contents to coded variables. Then, if there's only one coded variable/group, estimators like ALE and MKDA will just assume that that variable is the one you want to meta-analyze. But in principle you could have arbitrarily many, and when you pass the CodedData to a pairwise estimator, you need to name the two conditions that you're contrasting as arguments.

tyarkoni avatar Apr 09 '21 14:04 tyarkoni

To clarify, this sort of thing has always been the long-term plan; it's the reason I didn't want to build a lot of dataset-munging logic into NiMARE, because ideally all of the data representation (and probably also caching) should be handled by a NIMADS/NeuroStore client library. NiMARE would then only have to worry about the part of this that maps data directly into estimator inputs (as in the CodedData abstraction suggested above). This might be the right time to shift gears to building that library, rather than adding more functionality to NiMARE that will probably only end up moving out later.

tyarkoni avatar Apr 09 '21 15:04 tyarkoni

Had a call with @jdkent and @nicoalee today about annotations and this came up. I think I will have to break this down into at least three sets of changes:

  1. Superficial changes that rely on Dataset.slice. Basically, I will add group1_ids and group2_ids parameters to the estimators, and drop the dset2 parameter. Internally, the first thing the estimator would do is just split the dataset into two using the ID arguments, so most of the actual code will be the same, while the API will match what you're talking about.
  2. The deeper changes to optimize the code for the new API. I don't have time to tackle this at the moment, but we will want to do it at some point.
  3. Refactoring Datasets to match NIMADS and implementing the CodedData class. This is going to be a much larger undertaking.

tsalo avatar Dec 10 '21 17:12 tsalo