NiMARE icon indicating copy to clipboard operation
NiMARE copied to clipboard

Update `(M)KDAKernel` to do naive looping over studies when number of studies is large

Open tyarkoni opened this issue 3 years ago • 2 comments

Summary

In #386, I modified the (M)KDA map generation process to do multiple studies in parallel. This produced a significant speed gain (> 30%), but I didn't consider the memory impact: when the number of studies is very large, memory explodes, as we internally construct a 4d image (number of studies x spatial dimensions).

We don't want to give up the speed gains, because in most cases memory won't be an issue, but in the specific case of, e.g., setting up the entire Neurosynth dataset, the kernel transformation step becomes impractical and the user is forced to go to lowmem, which will probably be very slow. Proposed solution is to add a check for the number of studies, and if it's above some conservative value we can experiment with (say, 500 or 1000), to naively loop over studies being passed to compute_kda_ma from MKDAKernel._fit(), instead of passing them all in one shot. Alternatively, we could add a flag in the estimator, but I suspect people will miss this, so probably better to make an implicit decision for them.

Additional details

We get the best of both worlds: increased speed under normal conditions, and acceptable memory consumption at the cost of slower processing when the dataset is particularly large. (We could even try to be clever and do something in between, by passing batches of studies, but this feels unnecessary.)

Next steps

This block probably just needs to be wrapped in a loop that only kicks in if the number of studies is above some threshold.

tyarkoni avatar Apr 08 '21 15:04 tyarkoni

I finally stumbled across this issue today. Here's what the traceback looks like:

Traceback (most recent call last):
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/base.py", line 314, in fit
    maps = self._fit(dataset)
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/meta/cbma/base.py", line 78, in _fit
    ma_values = self._collect_ma_maps(
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/meta/cbma/base.py", line 176, in _collect_ma_maps
    ma_maps = self.kernel_transformer.transform(
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/meta/kernel.py", line 200, in transform
    transformed_maps = self._transform(mask, coordinates)
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/meta/kernel.py", line 403, in _transform
    transformed = compute_kda_ma(
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/meta/utils.py", line 347, in compute_kda_ma
    kernel_data = np.zeros(kernel_shape, dtype=type(value))
numpy.core._exceptions.MemoryError: Unable to allocate 13.8 GiB for an array with shape (2050, 91, 109, 91) and data type int64

tsalo avatar Feb 08 '22 22:02 tsalo

memory_limit should address the problem, so I think the question is whether the kernel should have a limit that is independent of the memory_limit parameter, or if memory_limit should be set to some non-None value by default.

tsalo avatar Feb 15 '22 19:02 tsalo