suite2p icon indicating copy to clipboard operation
suite2p copied to clipboard

POC: append output to input NWB file

Open chrisroat opened this issue 4 years ago • 5 comments

This is a proof of concept. It changes the behavior of the NWB output -- if the input is also NWB, the data is appended to the original file.

Output still written in suite2p native format, and then written again to NWB.

TODO: commit data file

chrisroat avatar Nov 04 '21 03:11 chrisroat

@bendichter @carsen-stringer

This is a proof of concept to see if it's a way to integrate nwb better with suite2p. (The test I included passes locally, but fails in the CI because I have not included the test file via dvc).

This change allows the user to write the suite2p output directly back to the nwb file that had the input data.

Is this a recommended approach for NWB use? My preference would be to write a separate file, but then references have to be maintained across files and that ended up getting a bit janky (though better than the fake data that Suite2p created, which I left in for the case of non-NWB input). For example, DANDI required a Subject ID when I wrote a separate file and that wasn't straightforward to work with.

The change keeps the pattern of "write out data in native format" then "rewrite data to NWB". A future commit (possibly part of this PR) could remove that extra I/O.

chrisroat avatar Nov 17 '21 19:11 chrisroat

Instead of this approach, the discussion on @bendichter's PR to this branch has made me rethink how to proceed.

Instead of adding to the input file, we can do a shallow copy of the input file to a new file (which will have links to data in the input file) and put the Suite2p output into this new file.

This would require that using NWB output format requires NWB as the input format. I think this is a worthwhile approach because otherwise, a lot of metadata must be faked or input by the user. It also encourages upstream data to be in NWB format. This is a change in behavior, and I wanted to be sure @carsen-stringer is OK with it.

Also of note - there is a known deficiency in DANDI with regard to the links between files, so the input/output file pair cannot be used in DANDI. As a workaround, a user could do a deep copy of the output file to create a single DANDI-compatible file that will contain both the input and output data.

chrisroat avatar Nov 22 '21 15:11 chrisroat

@chrisroat I want to make sure this feature is not reduced to a mode that only works for a subset of users. I would prefer it to remain possible to save processed data in a different file from the source data. I agree that faking data in the processed file is bad, but I would prefer to solve that problem by allowing the user to specify such data on save. For users that do not want to copy this type of metadata, it should also be possible to save the processed data to the original file. Your proposed solution also has the problem that it is not currently compatible with DANDI.

bendichter avatar Nov 22 '21 15:11 bendichter

I understand the desire to have many people be able to use this, but I worry it won't be sustainable to have each tool author maintain all of these different options. The boilerplate I'm seeing is making me think about alternative approaches.

In my personal experience, separating the I/O and the tool into layers is better than having the tool understand and be tested against every combination of I/O.

In a sense, I think of tool authors writing things like pandas or numpy functions, which are very focused and use simple data types, and shouldn't have to worry about the on-disk formats, external metadata structure, provenance, etc. With this sort of separation, tools can adapt to NWB updates in a much cleaner way.

We could leave the tool to operate directly on the optimal format (i.e. a numpy array). There would be an I/O translator layer with two implementations -- one that works with the current data formats, and one that works with NWB. The NWB wrapper could be maintained separately, and the other wrapper could be deprecated over a release cycle or two.

chrisroat avatar Nov 25 '21 04:11 chrisroat

@chrisroat that sounds a lot like what we have here. Example usage:

from roiextractors import NwbSegmentationExtractor, Suite2pSegmentationExtractor

seg_ex = Suite2pSegmentationExtractor("segmentation_datasets/suite2p")
NwbSegmentationExtractor.write_segmentation(seg_ex, "output_path.nwb")

This takes as input the standard suite2p output and can save to either an existing or new file and allows you to input the metadata dictionary if you want to.

bendichter avatar Nov 25 '21 20:11 bendichter

@chrisroat is correct, we can barely maintain the code we write ourselves. Thanks for the PR though, sorry we couldn't take it.

marius10p avatar Oct 30 '23 15:10 marius10p