z5 icon indicating copy to clipboard operation
z5 copied to clipboard

Support extendable datasets

Open tannerbitz opened this issue 2 years ago • 3 comments

tannerbitz avatar Oct 05 '21 18:10 tannerbitz

Thanks for opening this, @tannerbitz. Copying the relevant part form the mails:

I have a question regarding the use of your z5 library as an alternative to hdf5. Is there a way to extend a dataset during runtime, in my case, as a use for logging? Or possibly to link datasets together? Any help would be greatly appreciated! I'm looking to use both languages, C++ for embedded device use, and Python for later data analysis.
I'm open to using either data format, n5 or zarr. Based on your docs, it looks like I may be limited to little endian format for zarr, so maybe I should be leaning towards n5 as I'm not sure if the embedded devices in question will be little or big endian natively?

In theory both extending or linking datasets is possible (and not very complex due to the simple layout of zarr / n5 files). However, this is not implemented in z5 natively yet (mostly because I am not quite sure what would be the best interface for this). There are different routes to implement this:

  • extending datasets:
    • simple: just make the dimensions large enough that your data will always fit; due to chunking this does not incur any performance / storage downsides; if you need to know the actual dimensions you need to keep track of this in your code and can serialize this in the attributes. If you stop extending the dataset at some point you can then also crop the dimensions to the actual size.
    • advanced: extend the shape in the attributes if your write access extends the current dimensions (more complex because you may need to rewrite cropped border chunks)
  • linking datasets:
    • simple: load all the data into memory and write a new dataset
    • advanced: add up the shapes of the initial datasets and link or move the chunks (again, might need some extra treatment for cropped border chunks)

I don't have much time to integrate any of this into the library right now, but I could quickly code up a proto-type for any of these options in python; and it should also be relatively easy to translate this to C++.

Let me know what you think.

constantinpape avatar Oct 05 '21 19:10 constantinpape

I forget, do the zarr/N5 specs support dataset reshaping? They'd handle the edge chunks quite differently.

Otherwise, on the application side you could set aside a metadata key like "followed_by" which tells you the address of the next array, and you could fill them up like rotating log files.

clbarnes avatar Oct 06 '21 08:10 clbarnes

I forget, do the zarr/N5 specs support dataset reshaping? They'd handle the edge chunks quite differently.

I think there are some discussions going on in zarr, but there is not a canonical way to do it yet.

Otherwise, on the application side you could set aside a metadata key like "followed_by" which tells you the address of the next array, and you could fill them up like rotating log files.

Good point, that would also work.

constantinpape avatar Oct 06 '21 09:10 constantinpape