kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Add Zarr compatibility functions

Open ghidalgo3 opened this issue 1 year ago • 9 comments

This is a continuation of https://github.com/fsspec/kerchunk/pull/475 because I identified that there are several more behavior changes in zarr v3. Going forward, the utils.py file will abstract away the differences between ZarrV2 and ZarrV3 depending on what the user has installed.

Note that @jhamman's PR https://github.com/fsspec/kerchunk/pull/292 may now be outdated, though I'll merge those commits into this PR if its desired to expose a zarr_version to kerchunk callers.

ghidalgo3 avatar Jul 15 '24 22:07 ghidalgo3

Thank you for putting this together, you have clearly done a careful job (and thanks to @jhamman for the earlier attempt). This will take some time to go through.

I would encourage you to also look at the implementation in fsspec.implementations.reference, where I expect there will be more zarr2-specific assumptions, particularly in the parquet implementation.

martindurant avatar Jul 16 '24 14:07 martindurant

Following @TomAugspurger 's comments, it may well make sense to have None mean V2 for now, whichever zarr version is installed

martindurant avatar Jul 16 '24 14:07 martindurant

Thank you for putting this together, you have clearly done a careful job (and thanks to @jhamman for the earlier attempt). This will take some time to go through.

I would encourage you to also look at the implementation in fsspec.implementations.reference, where I expect there will be more zarr2-specific assumptions, particularly in the parquet implementation.

In fsspec I found that fsspec.mapping.FSMap or LazyReferenceMapper objects are passed to zarr.open and zarr.group calls. In ZarrV2, that worked because both of those types are MutableMappings and ZarrV2 knew how to work with MutableMapping stores.

However in ZarrV3, you need to pass a StoreLike object to any of the functions that previously expected MutableMappings. That means fsspec isn't compatible with zarrv3 either :) I'm not sure if these compatibility functions should also be defined at the fsspec layer or if zarr should provide a way to turn MutableMappings into one of its Store objects and preserve API compatibility. I think the latter is preferrable because that allows more users (not just fsspec and kerchunk) to easily adopt zarrv3.

ghidalgo3 avatar Jul 16 '24 18:07 ghidalgo3

I found that fsspec.mapping.FSMap or LazyReferenceMapper objects are passed to zarr.open and zarr.group calls.

No, this is not the model.

  • LazyReferenceMapper provides references to referenceFS. It is also map-like, but further work must be done to turn references to bytes
  • FSMap can be passed directly to zarr2, but it wraps it in a Store too, made for generic dict-like input
  • there is also an FSspecStore, which is what you get if you pass a URL to zarr.open
  • v3 has its own version of fsspecstore https://github.com/zarr-developers/zarr-python/pull/1785

martindurant avatar Jul 16 '24 19:07 martindurant

Sorry I didn't grok that. Here I see fsspec tests creating LazyReferenceMappers and passing them to zarr.open. In ZarrV3, all the object creation functions run the StoreLike through this function. I don't see how a RemoteStore would be created unless a user explicitly passed it to the zarr.open call like this:

store = RemoteStore(url="abfs://...")
zarr.open(store=store)

there is also an FSspecStore, which is what you get if you pass a URL to zarr.open

Unless I'm missing it, passing a URL to zarr.open in zarrV3 will return a local store after it's run through make_store_path:

def make_store_path(store_like: StoreLike | None, *, mode: OpenMode | None = None) -> StorePath:
    if isinstance(store_like, StorePath):
        if mode is not None:
            assert mode == store_like.store.mode
        return store_like
    elif isinstance(store_like, Store):
        if mode is not None:
            assert mode == store_like.mode
        return StorePath(store_like)
    elif store_like is None:
        if mode is None:
            mode = "w"  # exception to the default mode = 'r'
        return StorePath(MemoryStore(mode=mode))
    elif isinstance(store_like, str):
        return StorePath(LocalStore(Path(store_like), mode=mode or "r"))
    raise TypeError

Did this work in ZarrV2?

zarr.open(store="abfs://...")?

ghidalgo3 avatar Jul 16 '24 20:07 ghidalgo3

passing a URL to zarr.open in zarrV3 will return a local store after it's run through make_store_path

zarr.open(store="abfs://...")?

Yes, this most certainly worked, and if it no longer does, I consider this a major regression and worse user experience. The optional argument storage_options gets passed to fsspec.

I see fsspec tests creating LazyReferenceMappers and passing them to zarr.open

I suppose you are right and this works too - the lazy mapper is also dict-like. It's probably a case of my lazyness in the tests; it only works where the "files" are stored as raw binary, not as references (which is the case for the tests).

martindurant avatar Jul 17 '24 13:07 martindurant

cc @jhamman

martindurant avatar Jul 19 '24 13:07 martindurant

Yes, this most certainly worked, and if it no longer does, I consider this a major regression and worse user experience. The optional argument storage_options gets passed to fsspec.

It doesn't work in ZarrV3 yet:

# Missing storage_options but it's irrelevant
# this eventually tries to construct a StorePath(LocalStore, 'file://abfs:/daymet-zarr/daily/hi.zarr')
# and will throw because it's not writable
zarr.open(store="abfs://daymet-zarr/daily/hi.zarr")

I see fsspec tests creating LazyReferenceMappers and passing them to zarr.open

I suppose you are right and this works too - the lazy mapper is also dict-like. It's probably a case of my lazyness in the tests; it only works where the "files" are stored as raw binary, not as references (which is the case for the tests).

Laziness can be good, and your tests could probably use the MemoryStore at some point, but ZarrV3 shouldn't require you to change your call sites IMO.

ghidalgo3 avatar Jul 19 '24 20:07 ghidalgo3

Here probably - explicitly expects .zarray in every path containing data.

martindurant avatar Jul 22 '24 14:07 martindurant