xarray expose zarr caching from xarray

trafficstars

Zarr has its own internal mechanism for caching, described here:

https://zarr.readthedocs.io/en/stable/tutorial.html#distributed-cloud-storage
https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.LRUStoreCache

However, this capability is currently inaccessible from xarray.

I propose to add a new keyword cache=True/False to open_zarr which wraps the store in an LRUStoreCache.

Mar 14 '19 13:03 rabernat

Or should we use xarray's own caching mechanism?

Mar 14 '19 14:03 rabernat

I have created two PRs which attempt to provide zarr caching in different ways. I would welcome some advice on which one is a better approach.

Mar 14 '19 15:03 rabernat

Hi @rabernat, I looked at your PRs, and they seem to haven't gotten much attention.

I tried using a store with LRUCache in open_zarr, but it appears to ignore the cache.

For our use cases in https://github.com/TGSAI/mdio-python, we usually want to use any form of LRUCache (it doesn't have to be Zarr's necessarily).

Do you know of a hack to make this work?
What can we do to help and start working on this?

Sep 12 '22 14:09 tasansal

I have successfully used the Zarr LRU cache with Xarray. You just have to initialize the Store object outside of Xarray and then pass it to open_zarr or open_dataset(store, engine="zarr").

Have you tried that?

Sep 12 '22 14:09 rabernat

You just have to initialize the Store object outside of Xarray and then pass it to open_zarr or open_dataset(store, engine="zarr").

This would be good to document!

Sep 12 '22 15:09 dcherian

@rabernat, yes, I have tried that like this:

from zarr.storage import FSStore, LRUStoreCache
import xarray as xr

path = "gs://prefix/object.zarr"

store_nocache = FSStore(path)
store_cached = LRUStoreCache(store_nocache, max_size=2**30)

ds = xr.open_zarr(store_cached)

When I read the same data twice, it still downloads. Am I doing something wrong?

While I wait for a response, I will try it again and update if it works, but the last time I checked, it didn't.

Note to self: I also need to check it with Zarr backend and Dask backend.

Sep 13 '22 13:09 tasansal

@rabernat

Following up on the previous, yes it does work with the Zarr backend! I agree with @dcherian, we should add this to the docs.

However, the behavior in Dask is strange. I think it is making each worker have its own cache and blowing up memory if I ask for a large cache.

Sep 13 '22 21:09 tasansal

@tasansal a PR would be very welcome!

Sep 13 '22 22:09 dcherian

Glad you got it working! So you're saying it does not work with open_zarr and does work with open_dataset(...engine='zarr')? Weird. We should deprecate open_zarr.

However, the behavior in Dask is strange. I think it is making each worker have its own cache and blowing up memory if I ask for a large cache.

Yes, I think I experienced that as well. I think the entire cache is serialized and passed around between workers.

Sep 13 '22 22:09 rabernat

I couldn't get open_zarr to open without Daskifying arrays. open_dataset(..., engine="zarr") does open without Daskifying when you haven't passed chunks.

Sep 13 '22 22:09 tasansal

@dcherian, I will start a PR. Where do you think this belongs in the docs? Some places I can think of:

Examples section https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html
https://docs.xarray.dev/en/stable/user-guide/io.html
FAQ? https://docs.xarray.dev/en/stable/getting-started-guide/faq.html

Sep 13 '22 22:09 tasansal

docs.xarray.dev/en/stable/user-guide/io.html seems great to me.

Sep 14 '22 01:09 dcherian

I am working on a project where caching would be highly desirable. Version 3.0 of Zarr-Python does not contain LRUStoreCache anymore. Any idea on how caching to disk could now be implemented?

May 19 '25 05:05 spmvoss

xarray xarray copied to clipboard

expose zarr caching from xarray

xarray
xarray copied to clipboard