netcdf-c icon indicating copy to clipboard operation
netcdf-c copied to clipboard

Draft: Add mode to read consolidated ZARR datasets

Open mannreis opened this issue 1 year ago • 4 comments

This changes add a mode option mode=consolidated (perhaps best to do it by default when reading and fallback if fails) that will fetch a possibly existing .zmetadata file from the root of the dataset. That could serve as unified representation to be used whenever needing group or variable metadata further down the code path.

This is a WIP motivated by #2987 and lacks (at least):

  • [ ] Unit testing
  • [ ] Functional testing
  • [ ] Support Zarr V3
  • [ ] Robustness when open consolitaded not available

mannreis avatar Aug 27 '24 07:08 mannreis

The way I planned to do the consolidated metada (aside: would like a shorter term than "consolidated") for netcdfd is to create another dispatch layer for accessing various metadata pieces. So for v2, this would wrap read/write of .zroup .zarray and .zattrs. For v3, this would wrap access to zarr.json.

DennisHeimbigner avatar Aug 27 '24 20:08 DennisHeimbigner

How about using csd as a shorthand for consolidated (maybe even make both variants legal options)?

Personally, I would prefer to make consolidated the default, and fall back to unconsolidated, if no .zmetadata file is available (or the user explicitly asks for unconsolidated), but I would also understand if you prefer not to change existing behavior of libnetcdf...

florianziemen avatar Oct 22 '24 17:10 florianziemen

It occurs to me to ask. Why is the consolidated metadata in a separate .zmetadata rather than in the root groups' zarr.json?

DennisHeimbigner avatar Oct 22 '24 17:10 DennisHeimbigner

No idea why, but it is handled that way in zarr python for zarr2 ...

See your question here: https://github.com/zarr-developers/zarr-python/issues/720

florianziemen avatar Oct 22 '24 17:10 florianziemen

The way I planned to do the consolidated metada (aside: would like a shorter term than "consolidated") for netcdfd is to create another dispatch layer for accessing various metadata pieces. So for v2, this would wrap read/write of .zroup .zarray and .zattrs. For v3, this would wrap access to zarr.json.

You mean adding a block of function pointers to NC_Dispatch](https://github.com/Unidata/netcdf-c/blob/main/include/netcdf_dispatch.h.in#L34) that would handle the metadata(-file) operations for zarr? I was picturing something internal to to the NCZ_* layer but I don't have a really good overview of the code design.

mannreis avatar Oct 29 '24 14:10 mannreis

You mean adding a block of function pointers to NC_Dispatch

No, I was thinking of an internal dispatch table. When I added support for Zarr version 3, I created a dispatch table discriminated on the version. I then constructed some code to look at the URL and the Zarr dataset to infer which version to use. I would do the same for the metadata dispatcher but discriminating on consolidated or not.

DennisHeimbigner avatar Oct 29 '24 15:10 DennisHeimbigner

I finally understood you're referring to the implementation in the branches of your fork! I'm taking a look into zarrv3b.tmp, is this the branch you envision to merge? Just to clarify, what we'd like to have is that, when opening a consolidated dataset (without authentication), one could point to a "vanilla HTTP" server. This means that, HTTP-S3-specific requests like method=list-bucket-v2 would be avoided (when data is consolidated) or delayed (when not). Is this a sensible requirement?

mannreis avatar Nov 11 '24 12:11 mannreis