xarray
xarray copied to clipboard
Consider changing default `consolidated=None` to `False` for zarr_version=3 in `to_zarr()`
What is your issue?
Currently, in Dataset.to_zarr(), the consolidated parameter defaults to None, which means that xarray attempts to consolidate metadata by default. However, when using zarr_version=3, consolidated metadata is not required and might create issues. See this discussion
Would it make sense to change the default to False when zarr_version=3 is set, given that None currently implies metadata consolidation?
Interesting that @rabernat sounds less keen on consolidated metadata at that link
From afar, I had thought that consolidated metadata was seen as good! In particular, that zarr stores with lots of files were much faster with this enabled, particularly on blob or network-attached storage
I had also thought that appending to a zarr from xarray updated the consolidated metadata. I guess that's wrong?
Coincidentally @jhamman and I were just been talking about how we feel that consolidated metadata should no longer be the default in general, not just for zarr 3! The logic is:
- Consolidated metadata never made it into the zarr spec, not even the v3 spec, so our current default writing behaviour immediately encourages you to go off spec and our current default reading behaviour will chastise you for doing something spec-compliant.
- It is only of use for certain stores (i.e. helping with latency of cloud object stores and helping with traversability of http stores) but our default behaviour is to write consolidated metadata even for stores where it has no use, e.g.
MemoryStoreand local directory store. - The main argument for it is to reduce latency when opening deeply nested cloud object stores, but recent improvements to zarr-python by @d-v-b now mean this is a lot faster even without consolidated metadata, lessening the need for it.
We propose changing consolidated metadata to be opt-in (both write and read), following a deprecation cycle.
Awesome blog post! Thanks a lot. +1 from me, AFAIU
I am not convinced that stopping Xarray's use of consolidated metadata by default would be a service to our users. My main concern is that there are no good alternatives that acheive comparable performance. In the long term, I think something like Icechunk might solve this problem, but for now, even with @d-v-b's very impressive speed-ups, it is still too slow to open cloud based Zarr stores without consolidate metadata. For example, when I try the example in Earthmover's blog post in Google Colab, it takes 12 seconds with consolidated=False vs 900 ms with consolidated=True.
Two changes that I do think would make sense to improve user experience with consolidated metadata:
- We could update the default heuristics to only use consolidated metadata for stores where it is needed, e.g., for stores that access data from remote object stores.
- Xarray could ensure that appending to a Xarray store with
mode='a'updates consolidated metadata by default. This would avoid the consistency issues discussed in https://github.com/zarr-developers/zarr-python/issues/2830.
To respond to Tom's specific points:
- our current default writing behaviour immediately encourages you to go off spec and our current default reading behaviour will chastise you for doing something spec-compliant.
The other way to fix this would be to stop chastising users :). Yes, consolidated metadata is off spec, but unlike other Zarr extensions it doesn't result in creating data that isn't readable by other Zarr implementations. It only runs the risk of being potentially inconsistent.
- It is only of use for certain stores (i.e. helping with latency of cloud object stores and helping with traversability of http stores) but our default behaviour is to write consolidated metadata even for stores where it has no use, e.g. MemoryStore and local directory store.
This is a fair concern, but such stores are rarely used in my experience. Distributed filesystems & cloud stores are the norm with Zarr.
- The main argument for it is to reduce latency when opening deeply nested cloud object stores, but recent improvements to zarr-python by @d-v-b now mean this is a lot faster even without consolidated metadata, lessening the need for it.
Yes, this is a significant improvement, but as noted above opening large Zarr stores without consolidated metadata is too still slow (12 vs 1 second).
Thanks for the input @shoyer, those are all great points.
After further discussion with @jhamman I've made an alternative suggestion for how to remove this annoyance through upstream changes in zarr-python instead - see https://github.com/zarr-developers/zarr-python/issues/2937.
tl;dr: Whether or not one should be using consolidated metadata is fundamentally a property of the store, so there should be an actual property on zarr.Store which expresses this preference for each implementation. Then xarray can just quietly look at that instead of making users think about it.
I think that should allow us to keep reading/writing consolidated metadata by default for stores that benefit from it, whilst not use it for stores which don't, without the user having to understand and specify which is which.
https://github.com/zarr-developers/zarr-python/pull/3119 is starting on this. My preference matches Tom's proposal in https://github.com/pydata/xarray/issues/10122#issuecomment-2761798642, to have xarray check store.supports_consolidated_metadata and consolidate by default if that returns True.
zarr-python will set that as True for all of our stores, which means no behavior change for users of those stores (consolidate by default). Icechunk will set that to False.
We wanted to confirm that everyone is OK with that before moving forward with https://github.com/zarr-developers/zarr-python/pull/3119.
to have xarray check store.supports_consolidated_metadata and consolidate by default if that returns True.
An important detail: store.supports_consolidated_metadata is a cheap non I/O method. Essentially, all current Stores will return a constant.
zarr-python will set that as
Truefor all of our stores, which means no behavior change for users of those stores (consolidate by default). Icechunk will set that toFalse.
Works for me!