CaImAn icon indicating copy to clipboard operation
CaImAn copied to clipboard

cnmf.fit: support npy/NWB/hdf5

Open tbenst opened this issue 4 years ago • 14 comments

  • Describe the issue that you are experiencing When calling cnmf.fit, this line loads the entire dataset into memory:

https://github.com/flatironinstitute/CaImAn/blob/89ea84845d56f75f29f411c53c046e7fa0c79e9a/caiman/source_extraction/cnmf/cnmf.py#L451

One other line below this would likely cause the same issue: https://github.com/flatironinstitute/CaImAn/blob/89ea84845d56f75f29f411c53c046e7fa0c79e9a/caiman/source_extraction/cnmf/cnmf.py#L464

After that, there is mem-management logic: https://github.com/flatironinstitute/CaImAn/blob/89ea84845d56f75f29f411c53c046e7fa0c79e9a/caiman/source_extraction/cnmf/cnmf.py#L488

But at least some functions still look to raise errors, e.g. get_noise_fft is likely to use too much memory.

tbenst avatar May 08 '20 07:05 tbenst

<shameless plug> If you need to sub-select or transpose h5py Dataset objects without reading the whole thing, lazy_ops might be useful </shameless plug>

bendichter avatar May 08 '20 12:05 bendichter

@tbenst The fit method is called on a memory mapped file which s not fully loaded in memory. Is there a specific error you get?

epnev avatar May 09 '20 13:05 epnev

Hi @epnev, there is no error, caiman uses 120+GiB of RAM and is killed by kernel when swap depletes.

I’m calling fit on an Hdf5 dataset, which indeed is a memmapped file. But as I indicate above, slicing a h5py file or transposing in it involves reading to RAM.

tbenst avatar May 09 '20 17:05 tbenst

@tbenst The problem is that the code does not support calling fit directly on a HDF5 file (you can only do it if you don't do processing in patches which is not advisable). It has to be a numpy mmap file otherwise memory blows up. For mmap files transposing or slicing do not have this problem. I tried to fix in the past and allow slicing of non-memory mapped files but couldn't get it done.

Introducing lazy ops or z-arrays could be a solution here but personally I don't have the bandwidth to do it. But I'm obviously open to it if anybody wants to give it a try.

Also another thing that falls into the same category (useful feature but personally don't have the bandwidth) is adapting the online pipeline to 3D datasets. This is definitely the most scalable solution and most likely does not require a terrible amount of work.

epnev avatar May 09 '20 18:05 epnev

Thanks for the thoughtful explanation. I’m definitely sympathetic that solving this the right™ way may require a lot of dev time without functional benefits. Perhaps a reasonable compromise would be to transform the hdf5 file to a temporary mmap file, and delete after?

tbenst avatar May 10 '20 01:05 tbenst

Ah, I just realized that the fit code doesn’t support NWB/HDF5 at all. This is a bit unfortunate given that motion correct supports NWB/HDF5. Also seems that most of the fit code supports NWB/HDF5, as got much further this time with a smaller file. Changing issue title.

File "/home/tyler/code/babelfish/scripts/nwb/caiman", line 197, in main
    cnm = cnm.fit(images)
  File "/nix/store/smhqcp13f8887ajmj2ysjgcczh38lpjw-python3-3.7.6-env/lib/python3.7/site-packages/caiman/source_extraction/cnmf/cnmf.py", line 601, in fit
    images.filename, self.dims + (T,), self.params,
AttributeError: 'Dataset' object has no attribute 'filename'

Edit: read more of the code. Seems that the filepath gets passed around rather than the array. The dimension is also embedded in the filepath.

Would you be open to refactoring this code to make friendlier for hdf5? I may be able to help, but need guidance as the current implementation bakes in assumptions about the backend.

Memmaps can be read/written in parallel, and also numpy supports dimensions in memmaped .npy files via numpy.lib.format.open_memmap. With these two changes, there shouldn’t be a need to pass around filepath, and as a bonus it’s easier for users without caiman installed to read the results.

Looks like the transpose operation is needed due to decision to switch between C and Fortran ordering..I think this operation can be eliminated by changing ordering of the memmap file to match what numpy uses in the rest of the code.

Please let me know where I may be missing things, as I know that there’s often complex reasons for design decisions

tbenst avatar May 10 '20 02:05 tbenst

Ah, I just realized that the fit code doesn’t support NWB/HDF5 at all. This is a bit unfortunate given that motion correct supports NWB/HDF5. Also seems that most of the fit code supports NWB/HDF5, as got much further this time with a smaller file. Changing issue title.

The motion correction step supports multiple file formats but always generates a mmap file that gets used from the fit method. If you don't want to do motion correction you can always save your file as a mmap file and then pass it through the fit function. See lines 121-122 here.

File "/home/tyler/code/babelfish/scripts/nwb/caiman", line 197, in main
    cnm = cnm.fit(images)
  File "/nix/store/smhqcp13f8887ajmj2ysjgcczh38lpjw-python3-3.7.6-env/lib/python3.7/site-packages/caiman/source_extraction/cnmf/cnmf.py", line 601, in fit
    images.filename, self.dims + (T,), self.params,
AttributeError: 'Dataset' object has no attribute 'filename'

Edit: read more of the code. Seems that the filepath gets passed around rather than the array. The dimension is also embedded in the filepath.

Would you be open to refactoring this code to make friendlier for hdf5? I may be able to help, but need guidance as the current implementation bakes in assumptions about the backend.

Yes, that's pretty much what I faced too. While you can modify the code at most places at some point you reach a point where you end up loading multiple copies of the dataset which causes problems. The most promising approach to me is to use the shared memory techniques that Python 3.8 enables. See also the related discussion in #368 (Or invest in an good online implementation)

Memmaps can be read/written in parallel, and also numpy supports dimensions in memmaped .npy files via numpy.lib.format.open_memmap. With these two changes, there shouldn’t be a need to pass around filepath, and as a bonus it’s easier for users without caiman installed to read the results.

Yes, we utilize this behavior in the memmap files we create. Are you saying here there is a different type of numpy memory mapped files that allows the same for HDF5 files as well?

Looks like the transpose operation is needed due to decision to switch between C and Fortran ordering..I think this operation can be eliminated by changing ordering of the memmap file to match what numpy uses in the rest of the code.

Please let me know where I may be missing things, as I know that there’s often complex reasons for design decisions

The transpose operation in mmap files is more or less virtual. Nothing changes under the hood. However the ordering is important, and unfortunately you have to change into a 'C' order otherwise accessing a pixel's values across time would take forever. Copying from our CaImAn paper:

In the context of calcium imaging datasets, CaImAn batch represents the datasets in a matrix form Y, where each row corresponds to a different imaged pixel, and each column to a different frame. As a result, a column-major order mmap file enables the fast access of individual frames at a given time, whereas a row-major order files enables the fast access of an individual pixel at all times. To facilitate processing in patches CaImAn batch stores the data in row-major order. In practice, this is opposite to the order with which the data appears, one frame at a time. In order to reduce memory usage and speed up computation CaImAn batch employs a MapReduce approach, where either multiple files or multiple chunks of a big file composing the original datasets are processed and saved in mmap format in parallel. This operation includes two phases, first the chunks/files are saved (possibly after motion correction, if required) in multiple row-major mmap format, and then chunks are simultaneously combined into a single large row-major mmap

More modern frameworks (e.g. zarr, lazy ops, or dask) might be able to mitigate some of this pain but we haven't looked into that in detail.

epnev avatar May 10 '20 14:05 epnev

@tbenst Unfortunately I cannot put any significant effort on this (right now let's say supporting/developing CaImAn is a pro bono activity for me). However I'm pinging the other developers to see if there is any interest:

@agiovann @j-friedrich @pgunn Please see the discussion above. As 3D datasets become more ubiquitous we start hitting memory/space limits. Would you be willing to look into ways to increase our throughput? Possible directions could include:

  1. Use shared memory with Python 3.8
  2. Use something like zarr, lazy ops or dask
  3. Adapt the online processing pipeline.

epnev avatar May 10 '20 14:05 epnev

Just to add a little more context- lazy_ops would be useful if you want to transpose or slice an HDF5 dataset before reading the data. It looks like you are also interested in having multiple processes read and write an HDF5 file in parallel. lazy_ops won't help you with this per se, but HDF5 does have tools to allow multiple processes to read/write in parallel. See h5py's documentation on Parallel HDF5. We support this in NWB, with a proof-of-concept here. From personal experience, this is possible though non-trivial to set up. It might require custom compilations of libraries. To help with this, we have created docker and singularity containers that take care of the tough steps. Let me know if you go down this route and I might be able to help.

bendichter avatar May 10 '20 15:05 bendichter

@j-friedrich Let's chat about this next week?

pgunn avatar May 15 '20 01:05 pgunn

We just met about this and are chewing on ideas for tackling these problems longer-term.

pgunn avatar May 20 '20 18:05 pgunn

@agiovann @j-friedrich @pgunn Please see the discussion above. As 3D datasets become more ubiquitous we start hitting memory/space limits. Would you be willing to look into ways to increase our throughput? Possible directions could include:

1. Use shared memory with Python 3.8

2. Use something like zarr, lazy ops or dask

3. Adapt the online processing pipeline.

For now in the short term we are focusing on option 3 as there are good online algorithms (onacid, onacid-e), and plan to push/optimize them as much as possible to see how well this option works moving forward.

EricThomson avatar May 01 '23 18:05 EricThomson

Just a thought on (1), the shared memory interface is pretty low level and I would rather use zarr. Also, I'm not sure that (1) would solve the memory size issues.

Anyways it seems like Johanness has written (3).

kushalkolar avatar May 02 '23 23:05 kushalkolar

Yes I should have explicitly mentioned @j-friedrich has really pushed option 3 forward with cnmfe (there are now three notebooks for online-cnmfe, in addition to onacid-cnmf). At this point it will be interesting to see how much they can be ramped up to their potential as @kushalkolar has recently been doing especially using fastplotlib as a visualization tool.

EricThomson avatar May 03 '23 02:05 EricThomson