pynwb icon indicating copy to clipboard operation
pynwb copied to clipboard

Linking Subject from one nwb file to another

Open chrisroat opened this issue 4 years ago • 11 comments

Description

I have a 2 nwb files, the first one with a Subject and the second one without a Subject. I'd like to copy over the Subject from the first file to the second file, possibly using an external link. The example below attempts to use a BuildManager, but it is not successful. What is the correct way to do this?

Steps to Reproduce

from datetime import datetime
from dateutil.tz import tzlocal

from pynwb import NWBFile, NWBHDF5IO
from pynwb.file import Subject
from pynwb import get_manager

manager = get_manager()

with NWBHDF5IO('file1.nwb', 'w') as io1:
    nwbfile1 = NWBFile(
        session_description='file 1',
        identifier='file-1',
        session_start_time=datetime.now(tzlocal()),
        subject=Subject(subject_id="8675309"),
    )
    io1.write(nwbfile1)

with NWBHDF5IO('file2.nwb', 'w') as io2:
    nwbfile2 = NWBFile(
        session_description='file 2',
        identifier='file-2',
        session_start_time=datetime.now(tzlocal()),
    )
    io2.write(nwbfile2)

with NWBHDF5IO('file1.nwb', 'r', manager=manager) as io1r, NWBHDF5IO('file2.nwb', 'a', manager=manager) as io2a:
    nwbfile1r = io1r.read()
    nwbfile2a = io2a.read()
    nwbfile2a.subject = nwbfile1r.subject
    io2a.write(nwbfile2a)
   
with NWBHDF5IO('file2.nwb', 'r') as io2r:
    nwfile2r = io2r.read()
    assert nwfile2r.subject is not None, 'No subject in file2'

Environment

Python Executable: Python
Python Version: 3.8
Operating System: Linux
HDMF Version: 3.1.1
PyNWB Version: 2.0.0

Checklist

  • [x] Have you ensured the bug was not already reported?
  • [x] Have you included a brief and descriptive title?
  • [x] Have you included a clear description of the problem you are trying to solve?
  • [x] Have you included a minimal code snippet that reproduces the issue you are encountering?
  • [x] Have you checked our Contributing document?

chrisroat avatar Oct 08 '21 22:10 chrisroat

Hi @chrisroat

Just so I am clear, why would you prefer to use an external link here rather than just storing the subject in two different places?

Here's code that does what you are trying to do:

from datetime import datetime
from dateutil.tz import tzlocal

from pynwb import NWBFile, NWBHDF5IO
from pynwb.file import Subject
from pynwb import get_manager


with NWBHDF5IO('file1.nwb', 'w') as io1:
    nwbfile1 = NWBFile(
        session_description='file 1',
        identifier='file-1',
        session_start_time=datetime.now(tzlocal()),
        subject=Subject(subject_id="8675309"),
    )
    io1.write(nwbfile1)    
    
manager = get_manager()


with NWBHDF5IO('file1.nwb', 'r', manager=manager) as io1r:
    nwbfile1r = io1r.read()
    
    nwbfile2 = NWBFile(
        session_description='file 2',
        identifier='file-2',
        session_start_time=datetime.now(tzlocal()),
        subject=nwbfile1r.subject,
    )

    with NWBHDF5IO('file2.nwb', 'w', manager=manager) as io2a:
        io2a.write(nwbfile2)
    
with NWBHDF5IO('file2.nwb', 'r') as io2r:
    nwbfile2r = io2r.read()
    print(nwbfile2r.subject)

bendichter avatar Oct 12 '21 15:10 bendichter

I would want an external link so that if I modified data about my Subject, any file referencing it would automatically pick up those details. (Imagine if 1000s of files referenced this subject.)

Thanks for your example on creating a new file with an existing Subject. The use case I have has two existing files, and trying to copy Subject from one to the other (an append operation). This is needed right now with Suite2p, which does not copy over Subject (or image plane -- though it creates a fake image plane.... aargh). The needs is because to upload a file to Dandi, one needs a subject_id.

The right approach is to update Suite2p to do this itself (their NWB support exists but is minimal), but my intermediate hack is to do the following. I will be doing something similar for the image plane, now that I realize it is incorrect. https://github.com/deisseroth-lab/data-sharing-examples/blob/54f6ebb4ca9c52506a4cca968920b7e9412a3100/dandi/ophys/run_suite2p.py#L28-L35

chrisroat avatar Oct 12 '21 17:10 chrisroat

Hey @chrisroat! The way that I'm trying to work on this with our lab's use is to have a subject metadata file that contains what we need for the experiment to run properly and, at the end of the experiment, to write out a "base" NWB file that has the subject information built into it as well as any other metadata I can throw in there at the end of a given session. It builds NWB imaging planes out depending on what you imaged (ie multiple z locations) and the number of channels you recorded from (ie red/green).

Do you all think that having a subject metadata file (or, even better, a database) is a way to avoid having to copy over subject information like this or should I should I prepare to also use what you've made here for when we're at the point of sharing our data?

jmdelahanty avatar Oct 20 '21 02:10 jmdelahanty

The DANDI archive requires a subject_id, so if you are planning to upload to DANDI you will need to include it. It would be fine to have that information come from a metadata file or from a database. It sounds like you will have a "base" NWB file with all the necessary information, and that is what you'd read into Suite2p.

What I'm proposing is on the other side of things, when Suite2p writes an output NWB. Suite2p should, imho, copy over the metadata -- subject, imaging plane, etc. -- from the "base" (input) NWB file.

chrisroat avatar Oct 20 '21 02:10 chrisroat

Ah I see, got it. I do believe that's something the lab is looking to do once we have our experiments finished up.

What I'm proposing is on the other side of things, when Suite2p writes an output NWB. Suite2p should, imho, copy over the metadata -- subject, imaging plane, etc. -- from the "base" (input) NWB file.

I see, that makes sense. It definitely should do that, that's what this issue is about correct? It would be really cool to try and help work on that if you'd have me! I'm in the middle of getting stimulations going on our Bruker scope and I'm super close to getting your container working on our Docker machines so I'll be getting into Suite2P soon!

jmdelahanty avatar Oct 20 '21 03:10 jmdelahanty

We should probably keep discussions on the pertinent issues.

As far as this issue, we are still awaiting feedback. @bendichter , is the use case above not expected -- where an entity like a Subject needs to be duplicated across NWB files? Should it be that the upstream code should modify the original NWB file? (I think this could work, so long as multiple processes aren't trying to do the modifying at once on the same hdf5 file.)

chrisroat avatar Oct 27 '21 22:10 chrisroat

What do you mean by "upstream"?

bendichter avatar Oct 27 '21 23:10 bendichter

CatalystNeuro has a way of adding metadata from the suite2p default output format:

from roiextractors import NwbSegmentationExtractor, Suite2pSegmentationExtractor

metadata = dict(Subject=dict(subject_id='001'))

seg_ex = Suite2pSegmentationExtractor("segmentation_datasets/suite2p")
NwbSegmentationExtractor.write_segmentation(seg_ex, "output_path.nwb", metadata=metadata)

So you would write in the default output format and then convert to NWB. That function should allow you to create a new NWB file or append to an existing file. It's not as smooth as outputting to NWB directly, but it should work. I agree it would be better if suite2p had a way of adding metadata while writing the file, and better yet, also pulling the metadata from the source file, but I also think there might be a limit to how much we can expect suite2p to build around NWB.

I would suggest duplicating subject information, even if it's 1000s of files. You can try the external file approach but be warned that how DANDI will treat external links is uncharted territory.

bendichter avatar Oct 27 '21 23:10 bendichter

Its too bad that CatalystNeuro rolls their own solutions, especially one that is simply reading an old format and writing a new one. This might be an indication of a deeper problem here. It's also a pity that updating 1000s of files seems like a possible solution to a problem. I'd rather we figure out a way to update Suite2p so everyone can benefit, unless it seems way too difficult.

What do you mean by "upstream"?

I mean the input NWB file to Suite2p. Might a better solution be to write/append the output data directly to the incoming NWB file. This feels like it would work for many use cases. Is it in the spirit of how NWB is imagined?

We can start with the write output / read output / rewrite output approach, and then try to write directly as a second step.

chrisroat avatar Oct 27 '21 23:10 chrisroat

OK well these are IMO two separate issues.

suite2p conversion: We try to develop reusable NWB conversion tools for data formats and several labs we have worked with have given us the default suite2p output, so we built a converter for it. We "read and old format and write a new one" all the time, reading from a myriad different formats and writing to NWB. It's something we specialize in, and I don't see how that is an indication of any deep problem. The nice thing about this approach is it currently allows you to target a pre-existing NWB file and append to it. I realize this solution not ideal for your situation, where you would prefer to write directly to NWB, but usually the filesizes for image segmentation results aren't too bad so I thought you might find it helpful.

suite2p has also developed a straight-to-nwb approach, but without any way to add metadata other than what is already in the suite2p, but that does not include subject id. You could always add this yourself afterwards with a few lines of code. It looks like you might not be able to point save to an existing file and append to that file either. It would be nice if these features were added.

I mean the input NWB file to Suite2p. Might a better solution be to write/append the output data directly to the incoming NWB file. This feels like it would work for many use cases. Is it in the spirit of how NWB is imagined?

Yes, that is in the spirit of how NWB is imagined. I think it's worth raising an issue in suite2p and we can put someone one it.

subject info: Why are you trying to change the files? What types of metadata about subject do you not have on the creation of the NWB file?

The problem with external links is that as they currently stand they certainly would not function in the DANDI stream access mode, because s3 destroys the relative path that the external link used, but maybe that's an OK sacrifice. I think the best solution would be for users to use external links and then when the upload to DANDI either

a. come up with an s3-compatible way to link information across files or b. automatically turn external links into duplicated data on upload.

but right now we are trying to figure out our policy on external links and we don't have either of those solutions implemented.

bendichter avatar Oct 28 '21 00:10 bendichter

I will update Suite2p to write the output directly to the incoming NWB. I already have an issue open with them, and asked if this would be an acceptable change.

Suite2p does not have a straight-to-NWB approach -- their code writes data in their own format, and then rewrites it into NWB. It's missing the Subject info, hence why I sought advice on how to do this.

And yes, I have encountered the fact the DANDI munges filenames, and it seems the external linking stuff is messed up. In fact, Suite2p sets an empty external filename when using NWB as input, so that is moot.

chrisroat avatar Oct 28 '21 02:10 chrisroat