Zarr N5 spec diff
Overview of the diff between zarr and n5 specs with the potential goal of consolidating the two formats. @alimanfoo, @jakirkham / @axtimwalde please correct me if I am misrepresenting zarr / N5 spec or if you think there is something to add here. Note that the zarr and n5 spec have different naming conventions. The data-containers are called arrays in zarr and datasets in n5. Zarr refers to the nested storage of data-containers as hierarchies or groups ~(it is not quite clear to me what the actual difference is, see below)~, n5 only refers to groups. I will use the group / dataset notation.
Edit: Some corrections from @alimanfoo, I left in the original statements but striked them out.
Groups
- attributes
- zarr: groups MUST contain a json file
.zgroupwhich MUST containzarr_formatand MUST NOT contain any other keys. They CAN contain additional attributes in.zattrs - n5: groups CAN contain a file
attributes.jsoncontaining arbitrary json serializable attributes. The root group"/"MUST contain the keyn5with the n5 version.
- ~zarr makes a distinction between hierarchies and groups. I am not quite certain if there is a difference. The way I read the spec, having nested datasets is allowed, i.e. having a dataset that contains another dataset.~ Zarr does not allow nested datasets (i.e. a dataset containing another dataset). This is not allowed in n5 either, I think. The spec does not explicitly forbid it though.
Datasets
- metadata
- zarr: metadata is stored in
.zarray. - n5: metadata is stored in
attributes.json.
- layout:
- zarr: supports
C(row-major) andF(column major) indexing, which determines ~how chunks are indexed and~ how chunks are stored. This is determined via the keyorder. Chunks are always indexed as row-major. - n5: chunk indexing and storage is done according to column-major layout (
F).
- dtype:
- zarr: key
dtypeholds numpy type encoding. Importantly, supports big- and little- endian, which MUST be specified. n5: keydataType, only numerical types and only big endian.
- compression:
- zarr: supports all numcodecs compressors (and no compression), stored in key
compressors. - n5: by default supports
raw(= no compression),bzip2,gzip,lz4andxz. There is a mechanism to support additional compressors. Stored in keycompression.
- filters:
- zarr: supports additional filters from numcodecs that can be applied to chunks before (de)-serialization. Stored in key
filters. - n5: does not support this. However, the mechanism for additional compressors could be hijacked to achieve something similar.
- fill-value:
- zarr: the fill-value determines how chunks that don't exist are initialised. Stored in key
fill_value. - n5: fill-value is hard-coded to 0 (and hence not part of the spec).
- attributes:
- zarr: additional attributes can be stored in
.zattributes - n5: additional attributes can be stored in
attributes.json. MUST NOT override keys reserved for metadata. In addition, zarr and n5 store the shape of the dataset and of the chunks in the metadata with the keysshape,chunks/dimensions,blockSize.
Chunk storage
- header:
- zarr: chunks are stored without a header
- n5: chunks are stored with header, that encodes the chunk's mode (see 3.) and the shape of the chunk.
- shape of edge chunks:
- zarr: chunks are always stored with full chunk shape, also if they are over-hanging (e.g. chunk-shape
(30, 30)and dataset shape(100, 100)). - n5: only valid part of chunks is stored. This is possible due to 1.
- varlength chunks
- zarr: as far as I know not supported.
- n5: supports var-length mode (specified in header). In this case, the size of the chunk is not determined by the chunk's shape, but is additionally defined in the header. This is useful for ND storage of "less structured" data. E.g. a histogram of the values in the ROI corresponding to the chunk.
- indexing / storage
- zarr: chunks are indexed by
.separated keys, e.g.2.4. ~I think somewhere @alimanfoo mentioned that zarr also supports nested chunks, but I can't find this in the spec.~ These keys get mapped to a representation appropriate for the implementation. E. g. on the filesystem, keys can be trivially mapped to files called2.4or nested as2/4. - n5: chunks are stored nested, e.g.
2/4. (This is also implementation dependent. There are implementations where nested might not make sense. The difference is only.separated vs./separated.)
- zarr makes a distinction between hierarchies and groups. I am not quite certain if there is a difference. The way I read the spec, having nested datasets is allowed, i.e. having a dataset that contains another dataset. This is not allowed in n5, I think. The spec does not explicitly forbid it though.
Sorry for any confusion, nesting datasets inside datasets is not allowed in zarr. I.e., you can put a group inside another group, or you can put a dataset inside a group. The word "hierarchy" in the zarr spec is used to mean a tree of groups and datasets, starting from some root group.
- zarr: supports
CandFindexing, which determines how chunks are indexed and how chunks are stored. This is determined via the keyorder.
In zarr, the "C" or "F" order refers to the ordering of items within a chunk.
Not completely sure what you mean by "indexing" here. E.g., do you mean how we refer to a specific chunk within the grid of chunks for a given array? If so, the indexing of chunks within the chunk grid is only ever done in zarr in row-major order. E.g., for a 2D array of shape (100, 100) and chunk shape (10, 10), chunk "0.1" always means the chunk with rows 0-9 and columns 10-19.
Sorry for any confusion, nesting datasets inside datasets is not allowed in zarr. I.e., you can put a group inside another group, or you can put a dataset inside a group. The word "hierarchy" in the zarr spec is used to mean a tree of groups and datasets, starting from some root group.
Thanks for clarifying, that makes total sense.
I find the part on groups and hierarchies in the spec a bit confusing, maybe there is room to improve it.
That would be a separate issue though.
E.g., do you mean how we refer to a specific chunk within the grid of chunks for a given array? If so, the indexing of chunks within the chunk grid is only ever done in zarr in row-major order. E.g., for a 2D array of shape (100, 100) and chunk shape (10, 10), chunk "0.1" always means the chunk with rows 0-9 and columns 10-19.
Yes, that's what I meant. Good to know.
- zarr: chunks are stored flat by
.seperated keys, e.g.2.4. I think somewhere @alimanfoo mentioned that zarr also supports nested chunks, but I can't find this in the spec.
In zarr the chunk keys are always formed by separating chunk indices with a period, e.g. "2.4". However, the storage layer can make choices about how it maps keys down to underlying storage features. E.g., the default file-system storage class in zarr python (DirectoryStore) does the obvious thing of mapping keys to file paths without any transformation, so you will get a file called "2.4". But there is an alternative implementation of file-system storage (NestedDirectoryStore) which applies a transformation on the chunk keys to get to file paths, so you get file paths like "2/4".
This is an example of how in the zarr storage spec there is a separation between the store interface, which is assumed to be an abstract key-value interface and does not make any assumptions about how that will get implemented in terms of files or objects or memory or whatever; and the underlying storage implementation, which makes concrete decisions about what files to create (if using a file system) and what each file should contain.
The zarr storage spec does not place any constraint on the storage implementation, as long as you can provide a key-value interface over it then any form of storage is allowed. E.g., storing data inside an sqlite3, bdb or lmdb database, or zip file, are all valid ways of storing zarr data on a file system.
That said I fully take the point that several people have made that it would be useful to also have some concrete storage implementations documented, either within the zarr storage spec or in some associated specs, so that e.g. anyone who wants to implement a specific file format can do so more easily.
Many thanks @constantinpape, great summary. Hopefully comments have clarified a few points, but very happy to expand on any areas.
In zarr the chunk keys are always formed by separating chunk indices with a period, e.g. "2.4". However, the storage layer can make choices about how it maps keys down to underlying storage features.
Ok, this makes perfect sense. There might be implementations where nested does not have any meaning.
This means the difference to n5 is rather cosmetic, i.e. / separated vs. . separated. (And also C vs F order).
That said I fully take the point that several people have made that it would be useful to also have some concrete storage implementations documented, either within the zarr storage spec or in some associated specs
Yes, that would be very helpful indeed.
Thanks for clarifying @alimanfoo. I have edited the text accordingly.
In zarr the chunk keys are always formed by separating chunk indices with a period, e.g. "2.4". However, the storage layer can make choices about how it maps keys down to underlying storage features.
Ok, this makes perfect sense. There might be implementations where nested does not have any meaning. This means the difference to n5 is rather cosmetic, i.e.
/separated vs..separated. (And alsoCvsForder).
Along these lines PR ( https://github.com/zarr-developers/zarr/pull/309 ) might be of interest. This effectively remaps keys to allow access of N5 content from within the Python Zarr library.
Along these lines PR ( zarr-developers/zarr#309 ) might be of interest. This effectively remaps keys to allow access of N5 content from within the Python Zarr library.
Thanks for pointing this out @jakirkham. Will be very useful to read n5 from the zarr main library. Though afaik the n5 varlen mode is not supported yet. Maybe @funkey could clarify.
I think that in general consolidating the specs would be of great use nevertheless. It would reduce the double development effort and new language implementations would be able to read both formats by default.
This is great, thank you!
I think that in general consolidating the specs would be of great use nevertheless. It would reduce the double development effort and new language implementations would be able to read both formats by default.
:+1:
For the items you listed under Groups and Datasets, @constantinpape, I get the feeling that n5 could be characterized as "minimum" whereas zarr might be "standard", if that's a useful relationship to have between specs.
The chunk differences seem to reverse that though....
For the items you listed under
GroupsandDatasets, @constantinpape, I get the feeling that n5 could be characterized as "minimum" whereas zarr might be "standard", if that's a useful relationship to have between specs.The chunk differences seem to reverse that though....
Yes that captures it pretty well. For Groups the differences are minimal though.
For Datasets zarr is more elaborate, as it allows for more datatypes, e.g. unicode and strucuted datatypes (= tuple of datatypes) and also supports litte and big endian as well as C and F order
and native support for filters other than compression.
For chunks n5 is more expressive, as it supports clipped edge chunks and varlength mode by means of the header data.
Note that zarr supports something similar to the varlength chunks as well with datatype O, where encoding and decoding are achieved through a filter, see #6.
To me the n5 approach seems more portable though, because chunks can be decoded without the need for an extra filter.