zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

v2: Standardizing .zmetadata

Open DennisHeimbigner opened this issue 3 years ago • 11 comments

I want to begin a discussion about standardizing the .zmetadata format for consolidated metadata.

Suppose we have this Zarr container.

.zgroup -- of the root group
var1
    .zarray -- for var1
subgroup1
    .zgroup
    var2
        .zarray -- for var2
        .zattrs  -- for var2    

This structure needs to be encoded as JSON in the .zmetadata object. I can see two obvious encodings:

  1. nested encoding
{
".zgroup": {<contents of the .zgroup>},
"var1": {
    ".zarray": {<contents of .zarray>},
    }
"subgroup1": {
    ".zgroup": {<contents of the .zgroup>},
    "var2": {
        ".zarray": {<contents of .zarray>},
        ".zattrs": {<contents of .zattrs>},
        }
    }
}
  1. flat-key encoding
{
"/.zgroup": {<contents of the .zgroup>},
"/var1/.zarray": {<contents of .zarray>},
"/subgroup1/.zgroup": {<contents of the .zgroup>},
"/subgroup1/var2/.zarray": {<contents of .zarray>},
"/subgroup1/var2/.zattrs": {<contents of .zattr>},
}

My observations:

  • The flat-key encoding should, as a rule, be slightly smaller than the nested encode
  • The nested encoding would easier to process into internal data structures, but that would depend on the implementation. It would be faster for netcdf-c, but might not be for zarr-python.
  • Note that I have prefixed each key with "/", but that is just my choice; a decision is need about that.
  • The one example I have seen in the wild uses flat-key encoding.
  • The flat-key encoding has no entries for non-content bearing objects. So, for example, there is no "/subgroup1" key nor a "/subgroup1/var2" key. This seems reasonable since it would not add any useful information.

DennisHeimbigner avatar May 08 '21 19:05 DennisHeimbigner

The current .zmetadata format written by Zarr-Python uses the "flat-key encoding" without a prefix, e.g., looking at one of my recent datasets:

{
    "metadata": {
        ".zattrs": ...
        ".zgroup": ...
        "lat/.zarray": ...
        "lat/.zattrs": ...
        "level/.zarray": ...
        "level/.zattrs": ...
        ...
        "z/.zarray": ...
        "z/.zattrs": ...
    },
    "zarr_consolidated_format": 1
}

I suspect this format was chosen in part because it's slightly more natural in Zarr-Python to look up flat-keys rather than nested keys. But I'm sure performance would be fine with nested keys, too. Inherently both seem fine to me.

I trust that performance for parsing either structure in netCDF-C would probably be OK, too? At least for "reasonable" size consolidated metadata? But even if performance would be similar, if it's significantly harder to work with non-nested metadata on some platforms that is definitely worth considering.

shoyer avatar May 09 '21 20:05 shoyer

The reason that a nested encoding is slightly preferable for netcdf is that we track groups as independent objects, so we have to parse flat-keys got get the group info out of them. BTW, should there be a requirement that the order of flat keys be sorted?

DennisHeimbigner avatar May 09 '21 22:05 DennisHeimbigner

BTW, should there be a requirement that the order of flat keys be sorted?

JSON objects are unordered, so no, there should be no expectations about sorting.

shoyer avatar May 09 '21 23:05 shoyer

@DennisHeimbigner: are you thinking about the structure for V2, V3 or both?

joshmoore avatar May 13 '21 08:05 joshmoore

I was just looking at the existing V2 .zmetadata. Has the issue been raised for V3 yet?

DennisHeimbigner avatar May 13 '21 18:05 DennisHeimbigner

I see from this comment: https://github.com/zarr-developers/zarr-specs/issues/41#issuecomment-835580731 that there appears to be another .zmetadata encoding in use.

DennisHeimbigner avatar May 13 '21 18:05 DennisHeimbigner

I see from this comment: #41 (comment) that there appears to be another .zmetadata encoding in use.

Can you clarify what you mean here? I think this is the same .zmetadata encoding, just with keys printed in a different order (but JSON is not order sensitive).

shoyer avatar May 13 '21 18:05 shoyer

The header info


    'zarr_consolidated_format': 1,
    'metadata': {

is added. So it needs standarization also.

DennisHeimbigner avatar May 13 '21 18:05 DennisHeimbigner

The header info


    'zarr_consolidated_format': 1,
    'metadata': {

is added. So it needs standarization also.

Right, that was also in my example above: https://github.com/zarr-developers/zarr-specs/issues/113#issuecomment-835879740

shoyer avatar May 13 '21 19:05 shoyer

Just pinging this discussion based on today's zarr call: we need to sort out how to handle consolidated metadata for V3. Presumably it will be cheaper to list metadata in V3 because of the separation of metadata and data in the tree. But we still need an answer for "unlistable stores". Perhaps this is covered in the spec, but I did not see it.

rabernat avatar Aug 25 '21 19:08 rabernat

Consolidated metadata for v3 is being discussed in #136. Marking this issue for the v2 discussion.

jstriebel avatar Nov 16 '22 16:11 jstriebel