ngff icon indicating copy to clipboard operation
ngff copied to clipboard

Proposal: don't mix arrays and groups

Open d-v-b opened this issue 2 years ago • 5 comments

In the current spec, the zarr group for an image contains arrays (the different scale levels of the image data) as well as (optionally) a group (the labels group). E.g.,

foo.zarr                    # group, contains groups and arrays
foo.zarr/s0                 # array, scale level 0
foo.zarr/labels             # group, contains multiscale groups
foo.zarr/labels/original    # multiscale group, contains arrays
foo.zarr/labels/original/s0 # array, scale level 0

Storing arrays and groups in the same level of hierarchy can complicate semantics. I would propose the following structure instead:

foo.zarr                    # group, contains groups
foo.zarr/image              # multiscale group, contains arrays
foo.zarr/image/s0           # array, scale level 0
foo.zarr/labels             # group, contains multiscale groups
foo.zarr/labels/original    # multiscale group, contains arrays
foo.zarr/labels/original/s0 # array, scale level 0

Under with this scheme, we can define a "multiscale group" as a collection of arrays with a specific group metadata scheme, and then re-use that definition.

I chose the name "image" arbitrarily, it could also be something like "raw" or "default", and maybe the name of this group could be reserved.

d-v-b avatar Mar 10 '22 21:03 d-v-b

Storing arrays and groups in the same level of hierarchy can complicate semantics.

In the case of labels, I can definitely see correcting the structure as it currently stands. There has been ongoing confusion as seen with the currently open "labels" issues that we could likely address at the same time.

I'm a bit more hesitant about saying generally that arrays and groups can't be at the same level. My preference would be to have metadata that clearly defines the semantics.

I chose the name "image" arbitrarily, it could also be something like "raw" or "default", and maybe the name of this group could be reserved.

Similarly, I would avoid trying to encode too much in the names because invariably there will be difficulties getting everything need in a single group or array name.

foo.zarr/labels/original/s0 # array, scale level 0

And maybe relatedly, inspecting a plate today, I did have the impulse to add col, row, field, ..., scale prefixes to the integer-named groups to make the layout clearer. This would be a convention though with the metadata still being definitive as in https://github.com/ome/ngff/issues/48#issuecomment-1063813352.

joshmoore avatar Mar 10 '22 23:03 joshmoore

I'm a bit more hesitant about saying generally that arrays and groups can't be at the same level. My preference would be to have metadata that clearly defines the semantics.

Agreed, but Zarr already provides a set of operational semantics for "list-like thing" (zarr group) and "value-like thing" (zarr array). Life would be easier if the OME data model had a clean mapping onto the zarr semantics, e.g. something like type definitions.

On one extreme (most permissive typing), you could define that all elements of an OME-Zarr have the recursive type leaf = Union[Zarr.Group[leaf], Zarr.Array], in which case the layout of a single level of hierarchy doesn't convey any information, and metadata / object names must be inspected to infer the OME-Zarr semantics. At another extreme, every OME-Zarr element could be strictly typed -- e.g., MultiscaleGroup = Zarr.Group[Zarr.Array], OMEZarrRoot = Zarr.Group[MultiscaleGroup, Zarr.Group[MultiscaleGroup]]. This was my suggestion in the above comment.

The current spec is somewhere in the middle: it looks something like MultiscaleGroup = Zarr.Group[Zarr.Array, Optional[Zarr.Group[MultiscaleGroup]], i.e. a multiscale group is a collection of zarr arrays and optionally contains another group that contains more multiscale groups. Subjectively, having MultiscaleGroup optionally function as a generic container feels like an unnecessary overload of the semantics of a MultiscaleGroup that will make parsing harder.

And re: naming groups with integers -- what's the motivation for this? zarr uses integers as the names of chunk directories in nested storage, and for chunks themselves, so it seems like integer names should probably be reserved for that purpose.

d-v-b avatar Mar 11 '22 15:03 d-v-b

Zarr already provides a set of operational semantics for "list-like thing" (zarr group) and "value-like thing" (zarr array).

Guess I think of zarr groups as being more dictionary like than list like which may explain some of our differing opinion.

`Union[Zarr.Group[leaf], Zarr.Array]

This is how I think of it in the general case.

every OME-Zarr element could be strictly typed -- e.g., MultiscaleGroup = Zarr.Group[Zarr.Array], OMEZarrRoot = Zarr.Group[MultiscaleGroup, Zarr.Group[MultiscaleGroup]]. This was my suggestion in the above comment.

Gotcha. If this issue is "don't mix the multiscale zgroup" then I could see adding that typing info to the spec, even if not necessarily doing it for all zgroups. (The top-level zgroup is of course a special case)

what's the motivation for this? zarr uses integers as the names of chunk directories in nested storage, and for chunks themselves, so it seems like integer names should probably be reserved for that purpose.

That's what I was getting at as well (though in Zarr v3 chunks themselves also get prefixes)

joshmoore avatar Mar 11 '22 17:03 joshmoore

Guess I think of zarr groups as being more dictionary like than list like which may explain some of our differing opinion.

No, you're right, groups are definitely more dict-like than list-like, that's just an oversight on my part.

Gotcha. If this issue is "don't mix the multiscale zgroup" then I could see adding that typing info to the spec, even if not necessarily doing it for all zgroups. (The top-level zgroup is of course a special case)

Right, I support this tighter framing of the issue. I wasn't considering cases of zarrays outside of a multiscale group -- those situations should be considered separately.

d-v-b avatar Mar 11 '22 17:03 d-v-b

I didn't have time to fully read this issue; I will try to have a look later. But I already wanted express my full support for this:

And re: naming groups with integers -- what's the motivation for this? zarr uses integers as the names of chunk directories in nested storage, and for chunks themselves, so it seems like integer names should probably be reserved for that purpose.

These integer named groups are super confusing. They should be avoided in the spec. (As a side-note: having the scale arrays with integer names is also not helping and calling them e.g. s0, s1, s2, ... is much better). I also think that it's not a good choice to use these things as defaults, e.g. in bioformats2raw: https://github.com/glencoesoftware/bioformats2raw/issues/109.

constantinpape avatar Mar 16 '22 11:03 constantinpape