NiMARE
NiMARE copied to clipboard
Update `(M)KDAKernel` to do naive looping over studies when number of studies is large
Summary
In #386, I modified the (M)KDA map generation process to do multiple studies in parallel. This produced a significant speed gain (> 30%), but I didn't consider the memory impact: when the number of studies is very large, memory explodes, as we internally construct a 4d image (number of studies x spatial dimensions).
We don't want to give up the speed gains, because in most cases memory won't be an issue, but in the specific case of, e.g., setting up the entire Neurosynth dataset, the kernel transformation step becomes impractical and the user is forced to go to lowmem, which will probably be very slow. Proposed solution is to add a check for the number of studies, and if it's above some conservative value we can experiment with (say, 500 or 1000), to naively loop over studies being passed to compute_kda_ma
from MKDAKernel._fit()
, instead of passing them all in one shot. Alternatively, we could add a flag in the estimator, but I suspect people will miss this, so probably better to make an implicit decision for them.
Additional details
We get the best of both worlds: increased speed under normal conditions, and acceptable memory consumption at the cost of slower processing when the dataset is particularly large. (We could even try to be clever and do something in between, by passing batches of studies, but this feels unnecessary.)
Next steps
This block probably just needs to be wrapped in a loop that only kicks in if the number of studies is above some threshold.
I finally stumbled across this issue today. Here's what the traceback looks like:
Traceback (most recent call last):
File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/base.py", line 314, in fit
maps = self._fit(dataset)
File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/meta/cbma/base.py", line 78, in _fit
ma_values = self._collect_ma_maps(
File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/meta/cbma/base.py", line 176, in _collect_ma_maps
ma_maps = self.kernel_transformer.transform(
File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/meta/kernel.py", line 200, in transform
transformed_maps = self._transform(mask, coordinates)
File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/meta/kernel.py", line 403, in _transform
transformed = compute_kda_ma(
File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/meta/utils.py", line 347, in compute_kda_ma
kernel_data = np.zeros(kernel_shape, dtype=type(value))
numpy.core._exceptions.MemoryError: Unable to allocate 13.8 GiB for an array with shape (2050, 91, 109, 91) and data type int64
memory_limit
should address the problem, so I think the question is whether the kernel should have a limit that is independent of the memory_limit
parameter, or if memory_limit
should be set to some non-None value by default.