zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

FsspecStore.from_url does not handle a swift:// URL

Open fjansson opened this issue 6 months ago • 11 comments

Zarr version

3.0.9

Numcodecs version

0.16.1

Python Version

3.12.9

Operating System

Linux

Installation

using pip in virtual environment

Description

I'm trying to write zarr data into a swift store using zarr v3. Previously, using zarr v2, something like this used to work:

url = 'swift://swift.dkrz.de/some-path/'
store = zarr.storage.FSStore(url) 
ds.to_zarr(store)

Now I've tried different ways to use the FsspecStore, but I cannot get it to work, often meeting the error "path argument to FsspecStore must not include scheme". For example:

import zarr
url='swift://swift.dkrz.de/some-path/'
store = zarr.storage.FsspecStore.from_url(url, read_only=False)
File "/etc/ecmwf/nfs/dh1_home_b/nkjf/deep-botany-zarr/convert-to-zarr-0d.py", line 120, in <module>
    store = zarr.storage.FsspecStore.from_url(url, read_only=False)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/etc/ecmwf/nfs/dh1_home_b/nkjf/deep-botany-zarr/zarr_env/lib/python3.12/site-packages/zarr/storage/_fsspec.py", line 260, in from_url
    return cls(fs=fs, path=path, read_only=read_only, allowed_exceptions=allowed_exceptions)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/etc/ecmwf/nfs/dh1_home_b/nkjf/deep-botany-zarr/zarr_env/lib/python3.12/site-packages/zarr/storage/_fsspec.py", line 149, in __init__
    raise ValueError(f"path argument to FsspecStore must not include scheme ({scheme}://)")
ValueError: path argument to FsspecStore must not include scheme (swift://)

Steps to reproduce

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues

import zarr
url='swift://swift.dkrz.de/some-path/'
store = zarr.storage.FsspecStore.from_url(url, read_only=False)

Additional output

No response

fjansson avatar Jul 03 '25 10:07 fjansson

the script you included does not include all the dependencies. Here's a modified version that does:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
#  "swiftspec",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues

import zarr
url='swift://swift.dkrz.de/some-path/'
store = zarr.storage.FsspecStore.from_url(url, read_only=False)

the problem is the behavior of the SWIFTFileSystem.strip_protocol function, which doesn't actually strip the protocol:

> /home/bennettd/.cache/uv/archive-v0/KHF2glwiL0DdF34UOji17/lib/python3.12/site-packages/zarr/storage/_fsspec.py(256)from_url()
-> return cls(fs=fs, path=path, read_only=read_only, allowed_exceptions=allowed_exceptions)
(Pdb) path
'swift://swift.dkrz.de/some-path/'
(Pdb) fs._strip_protocol(path)
'swift://swift.dkrz.de/some-path/'
(Pdb) 

d-v-b avatar Jul 03 '25 10:07 d-v-b

Thanks @d-v-b ! This sounds like an issue in swiftspec then, I will open an issue there.

However it seems the behavior was there intentional: https://github.com/fsspec/swiftspec/blob/3ae321460997ce198cdd6864387a92a44f9e22a3/swiftspec/core.py#L158

@classmethod
    def _strip_protocol(cls, path):
        """For SWIFT, we always want to keep the full URL"""
        return path

fjansson avatar Jul 03 '25 10:07 fjansson

the http filesystem does something similar. I feel like this method should actually be called _strip_protocol_or_return_the_input_unchanged, and then it would be clear to simply not use it.

d-v-b avatar Jul 03 '25 11:07 d-v-b

these are the relevant zarr code sections for your use case:

https://github.com/zarr-developers/zarr-python/blob/5a24487f09d499c91ce25f24af910c2ede055b5a/src/zarr/storage/_fsspec.py#L250-L256

https://github.com/zarr-developers/zarr-python/blob/5a24487f09d499c91ce25f24af910c2ede055b5a/src/zarr/storage/_fsspec.py#L142-L145

Evidently you can find them by searching for (¯\_(ツ)_/¯)

We already have an exemption for http, we could add one for swift, or we could do something that checks, on the basis of the behavior of _strip_protocol, whether the underlying filesystem wants the protocol in there or not.

d-v-b avatar Jul 03 '25 11:07 d-v-b

I tried adding an exception in the second instance:

if "://" in path and not path.startswith("http") and not path.startswith("swift"):

The first instance does not need an exception, since the swiftspec _strip_protocol returns its argument.

Now I can write zarr directly to swift storage using xarray's ds.to_zarr().

fjansson avatar Jul 04 '25 11:07 fjansson

that's good news. it would be nice to avoid explicit special cases here though. maybe we should fall back to generic URL parsing instead of relying on the nearly worthless fs._strip_protocol method

d-v-b avatar Jul 04 '25 11:07 d-v-b

On line 142 above, how about not demanding that the protocol is absent? If some file systems don't want the protocol, path = fs._strip_protocol(path) would remove it for those. If it's really an error to reach line 142 with the protocol still in the path, perhaps check if _strip_protocol changed the path?

This would leave it up to the file systems to strip the protocol if they want to, and you could get rid of the exception also for http. As you @d-v-b said above, this would assume fs._strip_protocol behaves as _strip_protocol_or_return_the_input_unchanged.

The http implementation does the same as the swift one, it just returns the path: https://github.com/fsspec/filesystem_spec/blob/b669a80c94e51cdd960f01f74f5946eb8166b577/fsspec/implementations/http.py#L141-143

fjansson avatar Jul 09 '25 10:07 fjansson

I probably don't see the full picture, but I'd like to get this fixed somehow. As it is now this prevents us from writing zarr v3 to SWIFT, (and downgrading zarr to 2.x to write zarr v2 data to SWIFT also doesn't work for us due to missing dimension_separator support, https://github.com/pydata/xarray/issues/10569)

The current behavior, raising an exception if the protocol is included, comes from PR #2348, addressing issue #2342 . I don't understand why it's necessary to raise the exception.

  1. If the exception is needed with some protocols, but other protocols (http, https, swift) require the protocol to be present, it seems one has to exclude those protocols before the protocol-present test and exception.

  2. If the exception itself isn't necessary, it would be possible to just do path = fs._strip_protocol(path), and let each filesystem decide to strip the protocol or not. If the exception is there just to warn users when they supply the URL in the wrong format, how about stripped_path = fs._strip_protocol(path) followed by if stripped_path != path: and print a warning? Then the warning will be printed if a protocol was supplied for a filesystem that does not want the protocol in the URL.

To me 1) seems like the smaller and safer change to implement. I will try to make a PR and see how the tests react.

fjansson avatar Jul 28 '25 10:07 fjansson

I'm trying to pick this up again in PR #3343 @dstansby and @d-v-b.

I've re-based my commits there on the current main, and added a commit to remove the test of the exception that was removed in the first commit. With this the tests pass. Do you think it could be merged like this or is something more needed?

fjansson avatar Sep 18 '25 10:09 fjansson

I'll have a look later today

d-v-b avatar Sep 18 '25 10:09 d-v-b

I just stumbled across the same issue (when trying to write to the same object store 😄)

It would be really nice to have a "clean" solution for this; my personal definition for clean would include a dirty hack, as long as it is in the zarr-python library and not a local patch on my system 😉

lkluft avatar Oct 09 '25 14:10 lkluft