Zarr Python 3 tracking issue
Zarr-Python 3.0 is getting close to a full release. This issue tracks the integration of the 3.0 release with Kerchunk.
Here's a running list of issues we expect we need to solve either here or upstream:
- [ ] Develop an abstraction over Zarr Python's store interface (Zarr-Python 3.0's store is no longer a mutable mapping)
- [ ] Develop a reference-filesystem like store for Zarr
Eventually, we may also want to:
- [ ] construct Zarr's metadata classes directly rather than using Zarr's top-level API w/ memory stores
- [ ] support writing v3 metadata (https://github.com/fsspec/kerchunk/issues/235 is the issue for that)
xref: #504
I have been working on the zarr python support in kerchunk in #516, but i need to cycle off of it for a bit. So I am going to list out what is left to be done here and how it can be split off as best I can in hopes that someone else can help me to move this along in the meantime:
Heres where the functionality is at in that PR.
Generating references:
- [x] HDF
- [x] netcdf3
- [x] grib2
- [ ] zarr (not tested, might work)
- [ ] tiff (not tested, might work)
- [ ] others (not sure)
Reading with xarray
Currently creating a store works, however there are caveats:
- [x] fsspec
ReferenceFilesystemasynccat_fileis broken https://github.com/fsspec/filesystem_spec/pull/1734 - [x] zarr 3
RemoteStoreissues:zarr python3 requires that an fsspec filesystem used for aRemoteStoreis anAsyncFilesystem.ReferenceFilesystemsupports this, however this also means that the remote filesystem within theReferenceFilesystemmust be async. This means that all usage where data files are on a non async filesystem will not work. This blocks most of the tests in kerchunk as it stands today because LocalFilesystem is not async
Codecs
Codecs (filters, compressors) are treated differently in zarr 3. zarr 2 stores read with zarr 3 will still use numcodecs codecs, however if users want to ever use these codecs with another store (say to read a grib virtual dataset with Icechunk) the codecs from this package need to conform to zarr's Codec abc
- [ ] Codecs support zarr 3
CodecAPI
I think that this is mostly it, and im happy to help anyone who is interested in helping drive this forward
remote filesystem within the ReferenceFilesystem must be async
where data files are on a non async filesystem will not work
In a side channel, I commented that such a wrapper could be written, but would be a pretty ugly piece of code.
Sorry for jumping into the discussion, I've been following along for a bit.
Would the asynclocal filesystem in morefs be a good workaround for enabling tests?
https://github.com/iterative/morefs/blob/ea30fa8d3fa81a2b739b20810a81aede490e4571/src/morefs/asyn_local.py#L30
Sorry, @ap-- , only just noticed what you wrote. Your async wrapper for local is useful and should be more comprehensive than what I have done (below); but we might want to be able to wrap other sync filesystem too.
https://github.com/fsspec/kerchunk/pull/524 is a sketch of how you might wrap a synchronous filesystem to make it work with async calls.
In [1]: import fsspec
...: import kerchunk.sync_wrapper
...: fs = fsspec.filesystem("file")
...: sfs = kerchunk.sync_wrapper.SyncronousWrapper(fs)
...: await sfs._ls(".")
...:
Out[1]:
[files, ...]
I'm taking a crack at some of the outstanding tasks here though I'm not certain I have all the context that might be useful, so please disabuse me of any confusions that are obvious. As of now, cutting changes against @mpiannucci's v3 branch. Here's support for an abstraction over Zarr Python's store interface
So far, I've tested things on a locally generated zarr store produced from a couple of concatenated netcdfs. Any wisdom regarding your local testing and debugging workflows might prove useful going forward
EDIT: rereading things and thinking about the intent here, I'm wondering if my approach in the above PR (which is just to incorporate special logic for handling Store instances where relevant in kerchunk) is off. Might the 'abstraction over Zarr python's store interface' better be addressed by simply implementing a MutableMapping instance that wraps Zarr-Python's Store within zarr python itself?
With this PR merged https://github.com/fsspec/filesystem_spec/pull/1745 we are one step closer to getting the tests updated to work against zarr 3. Great work @moradology
I dont think that https://github.com/zarr-developers/zarr-python/issues/2468 is necessary to get the tests working here, as you can always create the filesystem and pass it to the RemoteStore manually without using a url
Is it just the codecs support left to do here now? I'm wondering if the references generation can be punted on by e.g. warning / raising if you try to call ZarrToZarr with zarr-python v3 installed...
@moradology is working on the tests, but yes AFAIK the codecs are the main thing left other than zarr to zarr and maybe multizarrtozarr
I posted an update here: https://github.com/fsspec/kerchunk/pull/516#issuecomment-2501887793 . Thanks @moradology for working on the async wrapper.
Most of the issues left are just testing things where zarr was being used in a way it just doesnt really support anymore, OR because we have no current way to wrap memory refs which a lot of the tests do.
zarr python 3 is a pretty big departure from version 2 so there is a lot of change here