NiMARE
NiMARE copied to clipboard
Refactor PairwiseEstimators to implement caching (only use one dataset instead of two)
Summary
I would find it easier to cache the ma_maps if dataset object remained consistent across calls to meta_estimator.fit
instead of shuffling the studies between two datasets. perhaps a separate argument or a column in the one of the dataset dataframes could indicate the grouping?
Additional details
there would be one call to _collect_ma_maps
(instead of two) where all the ma_maps would be loaded during the first run and then the result could be cached for future calls. Currently this would be hard since the two datasets change which studies are in each dataset:
https://github.com/neurostuff/NiMARE/blob/d0935dae85394c7f45f1e9a8e84845ffba270c47/nimare/meta/cbma/mkda.py#L205-L214
Next steps
- refactor pairwise estimators to use one dataset
- cache the result of getting all the MA maps.
@tsalo @tyarkoni: thoughts?
This sounds like what we had a while back, wherein you would feed in Dataset
to Estimator.__init__()
and a list of ids
to Estimator.fit()
. We moved away from that approach in #155, based on changes proposed in #115. The discussion in #115 is pretty dense so I'll need to look through it again to refresh my memory.
I think ultimately the right solution to this problem is to represent the data in a more structured way that closely aligns with the NIMADS spec. I.e., Dataset
would be a lightweight container containing Study
objects; Study
objects would contain Image
objects, and so on. If we added a caching layer on the intensive operations (e.g., transforming nifti volumes into in-memory masked arrays), then reshaping those into pandas DFs or numpy arrays on the fly would be pretty cheap. Note that in that case, so long as new Dataset
instances were created from old ones, references to existing Study
or Image
objects would be maintained, so there would be much less copying/slicing etc.
I think that's the cleanest way to do it, because if the caching is bound to images, then we automatically get the benefits anywhere in the package (masked) images are used, and don't have to think about it in specific analyses, or at Dataset
level, etc.
Oh, but re: having a single Dataset
instead of two: I think if we want to do it that way, we would probably need to introduce a new abstraction and conceptually separate dataset in the sense of "arbitrary collection of studies amenable to meta-analysis" from dataset in the sense of "a particular assignment of coordinates or images to one or more conditions used in a specific meta-analysis". This is already where we're planning to go with the front-end suite, so it would be very reasonable to mirror it in the back-end code.
The core idea then is that we introduce something like a CodedData
object (that's a crappy name, but just to make the point) that references a Dataset
and maps some of its contents to coded variables. Then, if there's only one coded variable/group, estimators like ALE and MKDA will just assume that that variable is the one you want to meta-analyze. But in principle you could have arbitrarily many, and when you pass the CodedData
to a pairwise estimator, you need to name the two conditions that you're contrasting as arguments.
To clarify, this sort of thing has always been the long-term plan; it's the reason I didn't want to build a lot of dataset-munging logic into NiMARE, because ideally all of the data representation (and probably also caching) should be handled by a NIMADS/NeuroStore client library. NiMARE would then only have to worry about the part of this that maps data directly into estimator inputs (as in the CodedData
abstraction suggested above). This might be the right time to shift gears to building that library, rather than adding more functionality to NiMARE that will probably only end up moving out later.
Had a call with @jdkent and @nicoalee today about annotations and this came up. I think I will have to break this down into at least three sets of changes:
- Superficial changes that rely on
Dataset.slice
. Basically, I will addgroup1_ids
andgroup2_ids
parameters to the estimators, and drop thedset2
parameter. Internally, the first thing the estimator would do is just split the dataset into two using the ID arguments, so most of the actual code will be the same, while the API will match what you're talking about. - The deeper changes to optimize the code for the new API. I don't have time to tackle this at the moment, but we will want to do it at some point.
- Refactoring Datasets to match NIMADS and implementing the CodedData class. This is going to be a much larger undertaking.