pynwb icon indicating copy to clipboard operation
pynwb copied to clipboard

[Feature]: Allow `overwrite_ok` argument to all add_X methods

Open jonahpearl opened this issue 3 years ago • 7 comments

What would you like to see added to PyNWB?

When creating a new processing pipeline, one often iteratively runs through it, fixing bugs as one goes. Since NWB's cannot be overwritten in-place, this means you either have to 1) delete the NWB each time (slow), or 2) write each step so that it checks if the step has already been completed, and skips it if so. However, this will still fail in the case where, say, step 3 has already completed and saved its data, but an error in the math (eg) causes step 4 to fail. In order to re-run step 3, currently, one has to delete the entire NWB and start over. This is quite inefficient.

Proposal: every .add_X() method (eg interace.add_timeseries()) should have an overwrite_ok option that is False by default. If False, the behavior is as it is now. But, if the user passes True, then on the back end, only the .data attributes of the object are changed. This allows a primitive "overwriting" of already-existing objects in NWBs.

For example, for adding a timeseries to an interface:

def self.add_timeseries(self, ts, overwrite_ok=False):
    if ts.name in self.time_series.keys() and overwrite_ok:
        self[ts.name].data[:] = ts.data
    elif ts.name in self.time_series.keys() and not overwrite_ok:
        raise ValueError('Time series exists but overwrite not ok!')
    else:
        self._add_timeseries(ts)  # where _add_timeseries is an internal func that mirrors the current add_timeseries functionality

I've written this as a function for myself to generate this behavior, but it seems like something quite desirable for developing new pipelines, especially ones where re-processing data is time intensive and people don't want to dig around in the H5 files to manually delete things.

Is your feature request related to a problem?

No

What solution would you like?

See above.

Do you have any interest in helping implement the feature?

Yes, but I would need guidance.

Code of Conduct

jonahpearl avatar Aug 11 '22 19:08 jonahpearl

Thank you @jonahpearl . This a great idea. Many of the add_x (and create_x) functions come from hdmf.container.MultiContainerInterface: https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/container.py#L730 https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/container.py#L761

It would be great if you could submit a pull request to suggest changes to that code in HDMF (which is then used in PyNWB).

These functions use docval to generate the function signature and keyword arguments allowed. It might be a little complicated to understand, but feel free to start a draft PR and ask how to do something in the PR, and we'll be happy to help.

rly avatar Aug 11 '22 21:08 rly

For this sort of iterative approach, where you need to repeatedly update or remove data, I think a more modular approach to storing the data from your pipeline could help(see https://pynwb.readthedocs.io/en/stable/tutorials/advanced_io/linking_data.html). I.e,. instead of writing Step 1, 2, 3 etc. of your pipeline to a single NWB file, you could write each step to a separate NWB file and then to make them look like a single file you could create a light-weight main NWB file that contains external links to the appropriate data in the other files. This approach would allow you to:

  • Remove the data from a particular step by just removing the NWB file that corresponds to that step. Depending on whether you are then regenerating the file in-place or not, you may (or may not) need to regenerate the main NWB file with the external links (however, since that file won't contain any data the cost for regenerating that file should be minimal)
  • Avoid repeated updates to files by keeping updates contained
  • Interact with the data as if it were in a single NWB file via the main file with the external links
  • Using NWBHDF5IO.export you can then also copy all the data to a single file at the end once you are happy with the results

Proposal: every .add_X() method (eg interace.add_timeseries()) should have an overwrite_ok option that is False by default.

The concept of allowing overwrite is interesting, however, a word of caution. Implementing this behavior in general I think will likely require a substantial amount of effort as it will require handling of many edge cases. For a general overwrite_ok one would need to overwrite not just TimeSeries.data but all datasets and attributes of any neurodata_type (which will be tricky to do, in particular with extensions). Even for the specific case of TimeSeries one would not just need to overwrite TimeSeries.data but also TimeSeries.timestamps and any other datasets and attributes in order to prevent creation of invalid data (e.g., due to mismatch between data and timestamps). In addition, there is the issue that the data provided may potentially have a different shape or even dtype. Changing shape is possible in some cases (if chunking is enabled) but changing dtype is not. In your particular case, it looks like you may be able to assume that shape and dtype of the data do not change and that timestamps and any other attributes are the same as well. However, these are very specific assumptions.

Implementing this on hdmf.container.MultiContainerInterface makes sense. Given the scope of the problem and for clarity of the API, I think it would be more explicit to implement something like this via new functions update_x instead of expanding scope of add_x.

If addressing the more general issue is too complicated, then maybe creating a separate utility function (that would take the nwbfile and timeseries object as input) may be more appropriate, as this would make it easier to limit the scope of the function to a more specific problem.

oruebel avatar Aug 11 '22 22:08 oruebel

I think I misunderstood the original issue. @jonahpearl are you creating / loading an NWB file and then adding to it / updating it for each step of processing? Or are you creating an NWBFile in-memory, adding to it / updating it for each step of processing, and then writing it once to disk? If the former, then yes, this can get pretty complicated to do for the reasons that @oruebel mentions.

Instead of allowing overwrite, what if we added an argument skip_duplicates=False and if the object with the same name already exists in the container, then instead of raising an error, do nothing. Would this be useful to you?

rly avatar Aug 11 '22 22:08 rly

Yeah, it's the former. @oruebel's suggestion is good in general, but I don't really want to have a separate NWB file for each analysis function in my data! skip_duplicates=False I've basically implemented manually by checking, in each analysis function, whether the data exists. The issue is, if I've already generated my_results_from_analysis_3, and I need to re-generate them because they were actually wrong, currently, with a default NWB setup, you can't.

I like the idea to have an update_X, because then you can implement checks like checking dtypes, checking sizes, etc. Is that implementable in the HDMFdocval structure, or is it too specific to NWB?

(And re-upping a question I tried to send via email but didn't seem to work: Do all HDMF containers have a .data attribute? Feels like it would almost be too abstract to just add what I wrote above into that very high-level function factory, but I can certainly give it a try. Do you have a dummy implementation of creating a class with the HDMF docval, so that I can easily test it? Or is there a function / script somewhere in the NWB library that will use it to make a time series class?)

jonahpearl avatar Aug 11 '22 22:08 jonahpearl

Do all HDMF containers have a .data attribute?

No that is not the case. The attributes depend on the schema and how they are mapped to class attributes. E.g., a DynamicTable type will typically not have a .data attribute (i.e., unless in the specific case that it contains a column that is named data).

I don't really want to have a separate NWB file for each analysis function in my data!

I'd be interested in hearing more about what your concerns with a more modular organization are? I'm just curious, because I would like to better understand your use case and the issues that a modular organization of the data may cause for you.

Feels like it would almost be too abstract to just add what I wrote above into that very high-level function factory,

For the more general case, I agree that the implementation would need to expand significantly from the example you provided in order to cover the various relevant cases.

and I need to re-generate them because they were actually wrong, currently, with a default NWB setup, you can't.

A key challenge here is that once a file has hit the disk there are certain permanent states that cannot be modified in place. E.g., if an analysis, such as a feature detector (e.g,. Spike detection or ROI detection) returns different numbers of features then one would need to change the shape of the data (and timestamps). This, however, is only possible if chunking and max_shape have been set accordingly for those datasets. Other properties, e.g., the dtype of a dataset, cannot be altered at all after write.

You can, in principle delete whole groups and datasets from an HDF5 file (e.g., h5py supports del syntax https://docs.h5py.org/en/stable/high/group.html#dict-interface-and-links). In this way you could remove the object from the file first and then add it back as a new object. However, a word of caution, deleting objects from HDF5 files only removes the object from the file metadata and declares the space as available, but it does not resize the file. HDF5 will then reuse the space that is available after deletion to store new objects if possible. This has two consequences: 1) after you delete an object the size the file will not changes, 2) if the data you add next does not nicely fit into the space that was freed you can end up with "fragmented" files. In your particular case this may not be an issue since the data you are replacing appears to have the same size, but I just wanted to mention it.

For your particular case, where you just want to update the values in the TimeSeries.data dataset, I think the approach of implementing this as a utility function may be simplest.

oruebel avatar Aug 11 '22 23:08 oruebel

Thank you for your detailed response. This helps me understand the pitfalls of overwriting data in HDMF.

For your particular case, where you just want to update the values in the TimeSeries.data dataset, I think the approach of implementing this as a utility function may be simplest.

Agreed -- I'll write a check into the utility function to check that the size / type of the data aren't changing.

I'd be interested in hearing more about what your concerns with a more modular organization are? I'm just curious, because I would like to better understand your use case and the issues that a modular organization of the data may cause for you.

For now, practically, I'm concerned it will take some time to implement, and add another layer of abstraction to what is already a fairly complex system. On principle, I am amenable to it, but it does conflict with my internal idea of nwb files as a "all in one" file. Maybe I just need to re-jigger my internal idea to be more like "a very flexible container that can hold whatever you need it to."

jonahpearl avatar Aug 12 '22 02:08 jonahpearl

Ah cool, I’d be happy to try. Do all HDMF containers have a .data attribute? Feels like it would almost be too abstract to just add what I wrote above into that very high-level function factory, but I can certainly give it a try. Do you have a dummy implementation of creating a class with the HDMF docval, so that I can easily test it? Or is there a function / script somewhere in the NWB library that will use it to make a time series class? Thanks!

On Aug 11, 2022, at 5:53 PM, Ryan Ly @.***> wrote:

Thank you @jonahpearl https://github.com/jonahpearl . This a great idea. Many of the add_x (and create_x) functions come from hdmf.container.MultiContainerInterface: https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/container.py#L730 https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/container.py#L730 https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/container.py#L761 https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/container.py#L761 It would be great if you could submit a pull request to suggest changes to that code in HDMF (which is then used in PyNWB).

These functions use docval to generate the function signature and keyword arguments allowed. It might be a little complicated to understand, but feel free to start a draft PR and ask how to do something in the PR.

— Reply to this email directly, view it on GitHub https://github.com/NeurodataWithoutBorders/pynwb/issues/1537#issuecomment-1212533241, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQKOLZEOEIRJGEJAEXNBCMDVYVY45ANCNFSM56JJHBBQ. You are receiving this because you were mentioned.

jonahpearl avatar Oct 11 '22 07:10 jonahpearl