xarray
xarray copied to clipboard
expose zarr caching from xarray
Zarr has its own internal mechanism for caching, described here:
- https://zarr.readthedocs.io/en/stable/tutorial.html#distributed-cloud-storage
- https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.LRUStoreCache
However, this capability is currently inaccessible from xarray.
I propose to add a new keyword cache=True/False to open_zarr which wraps the store in an LRUStoreCache.
Or should we use xarray's own caching mechanism?
I have created two PRs which attempt to provide zarr caching in different ways. I would welcome some advice on which one is a better approach.
Hi @rabernat, I looked at your PRs, and they seem to haven't gotten much attention.
I tried using a store with LRUCache in open_zarr, but it appears to ignore the cache.
For our use cases in https://github.com/TGSAI/mdio-python, we usually want to use any form of LRUCache (it doesn't have to be Zarr's necessarily).
- Do you know of a hack to make this work?
- What can we do to help and start working on this?
I have successfully used the Zarr LRU cache with Xarray. You just have to initialize the Store object outside of Xarray and then pass it to open_zarr or open_dataset(store, engine="zarr").
Have you tried that?
You just have to initialize the Store object outside of Xarray and then pass it to open_zarr or open_dataset(store, engine="zarr").
This would be good to document!
@rabernat, yes, I have tried that like this:
from zarr.storage import FSStore, LRUStoreCache
import xarray as xr
path = "gs://prefix/object.zarr"
store_nocache = FSStore(path)
store_cached = LRUStoreCache(store_nocache, max_size=2**30)
ds = xr.open_zarr(store_cached)
When I read the same data twice, it still downloads. Am I doing something wrong?
While I wait for a response, I will try it again and update if it works, but the last time I checked, it didn't.
Note to self: I also need to check it with Zarr backend and Dask backend.
@rabernat
Following up on the previous, yes it does work with the Zarr backend! I agree with @dcherian, we should add this to the docs.
However, the behavior in Dask is strange. I think it is making each worker have its own cache and blowing up memory if I ask for a large cache.
@tasansal a PR would be very welcome!
Glad you got it working! So you're saying it does not work with open_zarr and does work with open_dataset(...engine='zarr')? Weird. We should deprecate open_zarr.
However, the behavior in Dask is strange. I think it is making each worker have its own cache and blowing up memory if I ask for a large cache.
Yes, I think I experienced that as well. I think the entire cache is serialized and passed around between workers.
I couldn't get open_zarr to open without Daskifying arrays. open_dataset(..., engine="zarr") does open without Daskifying when you haven't passed chunks.
@dcherian, I will start a PR. Where do you think this belongs in the docs? Some places I can think of:
- Examples section https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html
- https://docs.xarray.dev/en/stable/user-guide/io.html
- FAQ? https://docs.xarray.dev/en/stable/getting-started-guide/faq.html
docs.xarray.dev/en/stable/user-guide/io.html seems great to me.
I am working on a project where caching would be highly desirable. Version 3.0 of Zarr-Python does not contain LRUStoreCache anymore. Any idea on how caching to disk could now be implemented?