mne-python
mne-python copied to clipboard
Add Support for NWB
Hi,
I've been working on creating a new subclass of BaseRaw
that can read from a NWB file. These are organized hdf files that can store multiple types of data streams including those with different sampling rates, sizes from different sources (such as eye-tracking, seeg, scalp eeg, audio etc.). Because they're hdf files they have lazy loading. It would be nice to have a RawNWB
object and/or a read_raw_nwb()
function so that MNE can work with these types of files. Attached is a link to a gist I made with some example code for reading/writing NWB files (Part 1) and some code snippets I've made to get started with creating a RawNWB
object (Part 2). The thing I mostly struggling with is creating the read_segment_file()
function. Any insight or help would be greatly appreciated.
https://gist.github.com/sportnoah14/b74f1226c09bd79aa7f7ec7b11af80c9
Hello! 👋 Thanks for opening your first issue here! ❤️ We will try to get back to you soon. 🚴🏽♂️
There is also an extension for NWB-NIRS that would be nice to support if this is implemented. See https://github.com/agencyenterprise/ndx-nirs Also ping @dsleiter
@rob-luke if that extension is a subclass of a TimeSeries
then it should work the same. It would be hard to encompass all the extensions. I think just having a single reader for a TimeSeries
container is a great start
ndx-nirs
stores NIRS data in a class that inherits from TimeSeries
, so it should be possible to read the raw data. However, to read channel metadata, I think RawNWB
might need to know about specific types of timeseries (eg. ElectricalSeries
, SpikeEventSeries
, NIRSSeries
, etc).
I'm not familiar with the usual implementation details of _read_segment_file
for other Raw
classes, but all pynwb timeseries objects should have a .data
attribute for access to the raw data.
https://nwb-schema.readthedocs.io/en/latest/format.html#timeseries
Would the following help in extracting the desired data?
timeseries = nwb.get_acquisition('timeseries_name')
timeseries.data[start:stop+1, idx]
Or was the difficulty with something else about the nwb files?
@dsleiter I was just meant that support for all TimeSeries
subclasses may be difficult. Not impossible though. I've already worked out a way to extract channel information for an ElectricalSeries
object. RawNWB
may just an up being an object with multiple if
statements in its declaration (or a function it calls in its declaration). The biggest issue for this may be if there's significant differences (such as attributes added, deprecated or changed) between versions of NWB. But we can figure that out after getting an initial version running.
Yeah, I think that separate logic might be needed for each TimeSeries
subclass in order to get channel information. That sounds good to start with ElectricalSeries
and add support for others later. There probably aren't too many cases that would need to be implemented in order to support the majority of use cases for mne users.
I believe the NWB team has been pretty good about ensuring backwards compatibility within major versions, so I'd expect the pyNWB 2.0 API to be pretty stable at least until version 3.0. It might be worth adding a check in the code to verify the major version of the installed pynwb
library and raise a warning or error if != 2.
Due to NWB's flexibility, other challenges might be handling situations with multiple acquisitions within a single file or ensuring a consistent label for ElectricalSeries
channels across different user-created files (looks like you explicitly add a 'label' column, which might not exist for most NWB files).
@dsleiter I agree, we would probably need separate logic for each case. If it's restricted to the TimeSeries
subclasses that are part of primary pynwb
package and the major extensions then that should be fine.
Beecause there can be many acquisitions, my thinking is that the RawNWB
object would interact with a single TimeSeries
as opposed to the whole file. So the function would work more like this:
from pynwb import NWBHDF5IO
io = NWBHDF5IO('myfile.nwb')
nwb = io.read()
raw = RawNWB(nwb.get_acquisition('eeg'))
For other specifications maybe a bunch of kwargs
can be added such as ch_names
as an argument that points to the column in the ElectrodeTable
object that specifies a name for a channel. Otherwise it can simply name channels as "chn1, chn2, chn3" like it would for another TimeSeries
object.
Before any of this though we need to get the _read_segment_file()
function working to support lazy loading. This is the part that I haven't figured out yet and am still working on. Once this, the most essential, aspect is done then we can go crazy with supporting all the subclasses of TimeSeries
@sportnoah14 still interested in this?
Before any of this though we need to get the _read_segment_file() function working to support lazy loading. This is the part that I haven't figured out yet and am still working on. Once this, the most essential, aspect is done then we can go crazy with supporting all the subclasses of TimeSeries
FWIW a first version doesn't absolutely have to have this. You can just always preload data (then _read_segment_file
does not need to be defined at all). It's not ideal but it's a good start, and might be good enough forever if these files are never expected to be very large...
@larsoner im still interested in this. I could start with having data preloaded but ideally I, and I think everyone using NWB files, would like to have lazy loading as these files can get very large. I'm currently on holiday and in the midst of moving but I'll be resuming this once I settle a bit more. Hopefully this coming week or the week after.
@dsleiter and @rob-luke I'd love your help with extensions once I get an initial version for preloaded data working for TimeSeries
objects. Maybe some of the core NWB developers would be interested in helping too.