zarr-specs
zarr-specs copied to clipboard
v3: chunk_memory_layout could be specified as an explicit order rather than C or F
The current specification allows "C"
and "F"
. But in some cases the optimal memory layout may not match the most natural dimension order, e.g. you might want the dimensions to be "zyxc", but the memory layout to be C order relative to the dimension order of czyx. To address that, instead of using "C"
and "F"
, the memory order can instead be specified as an explicit list of dimensions, e.g. [0, 1, 2]
for C order and [2, 1, 0]
for Fortran order (assuming 3 dimensions). Numpy supports arbitrary dimension orders just fine.
Note: This is the representation used by TensorStore: https://google.github.io/tensorstore/schema.html#json-ChunkLayout.inner_order
@jbms: did you also see the proposal in https://github.com/zarr-developers/zarr-specs/issues/126 to remove order
?
But in some cases the optimal memory layout may not match the most natural dimension order
Thanks for bringing this up. Can you link to some use cases or other documentation on this. In particular. In particular, I am not sure I understand what "natural" dimension order would mean, i.e. why should I present the data in a different order to the user than how it is stored?
Here is one example:
Suppose we are storing volumetric data indexed by x y z. It is natural to order the dimensions [x, y, z], or sometimes [z, y, x] if we want to use C order. But suppose we will be processing [x z] cross sections of the data, and therefore want the data to be stored as Fortran order relative to [x, z, y] for efficient access. For consistency, though, it may still be desired for the dimension order to be [x, y, z].
In general I see zarr as already an abstraction layer --- the data isn't actually stored in C order or Fortran order --- it is stored chunked and compressed, and it is only the intermediate uncompressed chunk representation that is in C order or Fortran order. If you use an image codec with zarr (see e.g. the imagecodecs Python package), this uncompressed C or Fortran order representation may not be relevant at all.
A better use case for this feature came up this evening: t5x (https://github.com/google-research/t5x) uses tensorstore to store machine learning model checkpoints. A user had modified the model to transpose the first two parameters of some variables, but wanted to load an existing checkpoint. This was possible without actually modifying the checkpoint or adding any special code to transpose when loading the model, by just modifying the tensorstore specs stored as part of the checkpoint to perform a transpose via an "index transform" (https://google.github.io/tensorstore/index_space.html#json-IndexTransform). However, it would be nice if this could be accomplished purely with zarr just by modifying the metadata file.
If we allow an arbitrary permutation as the chunk_memory_layout
, and furthermore use the same order to generate the chunk keys, then we can transpose the dimensions of an array purely by modifying the metadata.
IMO it's a benefit to know the underlying data layout easily to be able to reason about efficiency when traversing and indexing an array. C and Fortran order are well-known concepts, whereas an arbitrary order is rather unusual. I'd argue that re-ordering the dimensions might still be allowed in an implementation, but this would not necessarily affect the metadata, similar to numpy.moveaxis
not changing the underlying array, just providing a different view of the data.
If this seems to be more useful, I'd rather make this an extension than a core feature of zarr, do you agree @jbms?
PS: Especially with #162 it's possible for clients to re-order the axes as needed and not rely on an expected order.