pynwb icon indicating copy to clipboard operation
pynwb copied to clipboard

Docs about updating or adding data

Open matham opened this issue 7 years ago • 3 comments

I'm looking at pynwb and how I can use it in our pipeline, but it's not exactly clear to me some of the data manipulation aspects as a user:

  • Can I make use of the fact that we're dealing with hdf5 files and directly manipulate the file, or do I always have to go through the pywnb wrapper.
  • Specifically, the docs show how to create a file and add data, but how do I stream live data to the file? E.g. say I create a TimeSeries representing ephys data, how can keep appending data (efficiently) to the TimeSeries? E.g. h5py lets you do it with http://docs.h5py.org/en/latest/faq.html#appending-data-to-a-dataset.
  • Say I have the file and then call io.write(f). Then I add more data and call it again. Would it re-write the whole file again? What about huge files? I guess it isn't clear to me how thin the pynwb wrapper is on top of h5py or if it's completely seperate.
  • It doesn't let me e.g. change the description of ElectricalSeries etc. once created. In a typical experiment I'd creates a bunch of data, write in a notepad the conditions for each run and then later fill in all the info to the file. But if it's write once that wouldn't work. I suppose I can create the data live and then later import it to a new file and add all this info then. But that may create data duplication, which is probably unwanted. Although I suppose it may be best to not edit original data at all once created, but rather link to it.

Thanks in advance!

matham avatar Dec 20 '17 20:12 matham

In the order that you asked your questions:

  • Yes. PyNWB is just writing HDF5 files... there is nothing about these files that prevents you from opening them up with h5py or any other HDF5 bindings.
  • There is some support for writing streams. Unfortunately the documentation is a bit lacking. The key data structure is the DataChunkIterator. You would wrap up your stream object with a DataChunkIterator and pass it in place of a list/ndarray/DataFrame/etc The stream will not be read from until io.write(f) is called on the parent file. Also, these streams will be pulled from in serial. As a disclaimer, this is an advanced feature that we haven't had the chance to test in the real world outside of converting datasets that are too large to fit into memory. I assume in your case you would want to write data from multiple streams concurrently? We would appreciate any input you have on the is. @oruebel may also have more to say
  • So long as you are using the same FORMIO object OR you use the same BuildManager when creating different FORMIO objects, you can call io.write(f) multiple times without rewriting data that has already been written.
  • As of now, there is no editing of data after it is written. You can add new containers (i.e TimeSeries, ProcessingModules, etc), but you can't edit existing ones. We're not necessarily opposed to a feature like this, but we haven't had any requests for it yet and we want to make sure we get it right before taking it on.

Please feel free to join the Slack channel

ajtritt avatar Dec 20 '17 21:12 ajtritt

I have not done any particular performance tests with streaming data yet, so please let us know in case you run into any issues with using the DataChunkIterator.

Re @ajtritt comment regarding multiple streams. While you can have multiple DataChunkIterators, currently they are processed on the backend one-at-a-time, i.e., you cannot yet have multiple simultaneous data streams into a single file. Doing multiple streams into a single file (whether HDF5 or some other file) seems problematic in general, at the very least from a performance perspective but it can also create other problems (e.g., fragmentation of files where chunks from different datasets become intermixed and "scattered" in the file).

oruebel avatar Dec 21 '17 23:12 oruebel

Thanks for the answers.

  • First, regarding data editing after the fact, that is not as important as I can do the data linking as mentioned (although I'm not sure if linking requires that all attrs match?).
  • Regarding calling io.write(f) multiple times. I tried with HDF5IO mode w and it just overwrote it each time as far as I can tell. With mode a, it complained on the second write that ValueError: Unable to create group (name already exists). Do you have example code of multiple writes?
  • So yeah, I guess DataChunkIterator is kind of what I want, and I do like the buffering, except I do have multiple (independent) streams being written to the same file, but also the data is not all available when write is called. I see the discussion at https://github.com/NeurodataWithoutBorders/pynwb/issues/81 and https://github.com/NeurodataWithoutBorders/pynwb/issues/14.

As an alternative approach and why I asked question (1), if I can use pynwb to create all the nwb data structures in the h5py file, then I can open the file with h5py and just append data for each stream independently. With maybe pynwb somehow returning the h5py Dataset from the Container. This wouldn't work if pynwb saves the dataset size, unless I also know to update it. Besides being dirty and doing low-level stuff is this a reasonable approach? Perhaps that can be abstracted, and I think I can make it work with DataChunkIterator.

Check me if I understand the DataChunkIterator. Normally, a user has all the data by the time write is called. DataChunkIterator is useful when the data comes in chunks after the NWBFile file is created but before it is written to disk. So when we write to disk all the data better be there otherwise __chunked_iter_fill__ gets stuck waiting on data for that stream. What I want is to be able to call write many times, and each time it writes the currently (newly) available data. This isn't something DataChunkIterator currently seems to be able to handle The reason is to make sure data is saved periodically, rather than accumulated in memory during an experiment.

So in my proof of concept PR (https://github.com/NeurodataWithoutBorders/pynwb/pull/310) instead of DataChunkIterator, the simplest way to abstract is to have a DataChunkStream to be passed as data and then, at each successive write, we just write all the data currently in the queue for each dataset. This way I can keep appending data and cause it to be written e.g. every 5 minutes in a hdf5-low-level independent manner. I even added buffering.

I did have to change everywhere to use AbstractDataChunkIterator, rather than (inheriting from) DataChunkIterator which has a lot of extra stuff and for the issues explained below. I do think given that DataChunkIterator is a bit specific, unless it's just used as an example implementation, but the appropriate isinstance to use should be AbstractDataChunkIterator. And the way h5py makes uses of its specific attributes should probably be changed. E.g. it uses dtype attribute of DataChunkIterator directly, but AbstractDataChunkIterator doesn't have this attribute. And if it is to be used in the formio then it needs to be added to AbstractDataChunkIterator like e.g. recommended_chunk_shape.

From an abstraction POV, the io resizes the array based on the selection slice largest index in the DataChunk https://github.com/NeurodataWithoutBorders/pynwb/blob/dev/src/pynwb/form/backends/hdf5/h5tools.py#L562. So e.g. if I'm buffering at 100 values, and I pass DataChunkIterator an array of 10,000 values it'll do 100 resizes, 1 for each chunk. Unless we pass it the last chunk (with the largest indices) first, like I did in my stream to fix this. Basically, the problem is that we're indirectly causing a resize based on the selection values. Instead, AbstractDataChunkIterator should have a attribute e.g. data_size which we check when reading every chunk. And if data_size is bigger than the formio array, it gets resized. This makes it explicit and easier to control. I can implement this if you like.

Also, it seems like DataChunkIterator copies every single item when buffering (https://github.com/NeurodataWithoutBorders/pynwb/blob/dev/src/pynwb/form/data_utils.py#L143), which should cause a performance hit. That's why I decided to use a minimum buffer, so once exceeded everything gets written and avoids having to split/copy data. I also made a choice to only accept a single dimension appending for DataChunkStream. The same for the data, it accepts lists or arrays etc. not iterators because otherwise it just makes everything more complicated and people can roll their own if they need to.


I really liked your async suggestion also; once I dropped py2 support I have really gotten into async recently and found trio to be amazing. I have started converting all my hardware data sources to use async. But, given the example above I'm not sure it's needed. If you did want to use it, you'd probably need to make write an async method which internally calls write_builder followed by write_plumb_data. But write_plumb_data would be an async method that polls all the AsyncAbstractDataChunkIterator for data continuously so that none of them block. It'd be an interesting thing to do.

matham avatar Dec 22 '17 02:12 matham