community icon indicating copy to clipboard operation
community copied to clipboard

Collaborating with Exdir

Open jakirkham opened this issue 6 years ago • 4 comments

The Exdir file format, which is also discussed in this paper, has some similarities with Zarr's directory storage format. Would be interesting to learn more about the similarities and differences as well as see if there is an opportunity to collaborate and grow from the effort.

jakirkham avatar Nov 16 '18 04:11 jakirkham

Any thoughts on this @dragly and @lepmik?

jakirkham avatar Nov 26 '18 17:11 jakirkham

@jakirkham That would be very interesting! Thanks for reaching out. I had a brief look at zarr some time ago and noticed some similarities, but didn't have the time to dig into the details. Give us a bit of time to get a better overview and we'll get back with some thoughts on how we could progress.

dragly avatar Dec 02 '18 22:12 dragly

Sorry it took so long to get back to you about this.

I took a closer look now and see that there are definitely a lot of similarities between Exdir and the Zarr directory storage format.

I think the main differences between Exdir and Zarr can be summarized as:

  • Exdir using YAML instead of JSON for metadata.
  • Exdir using the simple NumPy file format to store data, without support for chunks or compression.
  • Zarr having direct support for multiple storage backends, including database servers and binary archives like Zip.
  • Zarr having support for variable length strings (there is a PR for using Feather as the underlying data format for this purpose in Exdir, but it is still not merged).
  • Zarr having a much wider community and more traction. This is really nice to see, by the way. I hope an alternative to HDF5 will get adopted by the wider scientific community over the years.

I think both projects nicely solve many of the issues we had with using HDF5 in our lab. Using both Zarr and Exdir are great improvements over HDF5 for the particular issues we had.

I still like the simplicity of using pure NumPy files for storage in Exdir. We have also had a good experience with YAML files, since they are slightly easier to edit and understand, especially for people without a programming background. However, the question remains whether these arguments are important enough to others to warrant two so similar, yet slightly different storage backends?

If so, would it be interesting to have an Exdir storage backend for Zarr? And should it then

  1. adhere to the Zarr specification and
    • store the data as bytes in several NumPy files (per chunk) and
    • metadata in the exdir.yaml file, or
  2. adhere to the Exdir specification and
    • store the data as the correct data type in a flat NumPy file and
    • ignore chunking (put everything in one file)
    • fall back to storing bytes+metadata for other data types (objects, compression, etc.)

Another alternative is to embrace the Zarr directory storage and provide a conversion path for existing Exdir users.

Any thoughts?

dragly avatar Feb 25 '21 12:02 dragly

I've just made an implementation for storing internal data files of Exdir in the OCaml language. The implementation was quite straightforward (already had an implementation for the npy format), but the handling yaml is complicated (There is an ocaml binding but it sometimes segfaults).

I do believe that YAML was not a good choice for metadata, because it's hard for many language to come up with their implementation of YAML. Thanks for your work on Exdir anyway!

Best,

vbmithr avatar Apr 29 '21 14:04 vbmithr