cubed
cubed copied to clipboard
Use Zarr groups for outputs with multiple fields
We currently use a single Zarr array with a structured data type for storing intermediate outputs with multiple fields (such as the total and count when computing the mean).
Structured types can be inefficient to store since they are row-based, so can't take advantage of Zarr's columnar encoding. On the other hand, they can benefit from a single IO operation to get or set multiple fields, although we don't use this on the write side at least.
The obvious alternative would be to store the separate fields in separate arrays in a Zarr group. This would improve the storage efficiency, and the IO overhead could be mitigated by using asyncio to read/write arrays in parallel. This would need Zarr v3, which also does not include structured data types in the v3 core spec.
So this should definitely be a Zarr v3 feature, but may be useful for v2 as well (if benchmarking showed it didn't perform worse).
See also #197