FsspecStore.from_url does not handle a swift:// URL
Zarr version
3.0.9
Numcodecs version
0.16.1
Python Version
3.12.9
Operating System
Linux
Installation
using pip in virtual environment
Description
I'm trying to write zarr data into a swift store using zarr v3. Previously, using zarr v2, something like this used to work:
url = 'swift://swift.dkrz.de/some-path/'
store = zarr.storage.FSStore(url)
ds.to_zarr(store)
Now I've tried different ways to use the FsspecStore, but I cannot get it to work, often meeting the error "path argument to FsspecStore must not include scheme". For example:
import zarr
url='swift://swift.dkrz.de/some-path/'
store = zarr.storage.FsspecStore.from_url(url, read_only=False)
File "/etc/ecmwf/nfs/dh1_home_b/nkjf/deep-botany-zarr/convert-to-zarr-0d.py", line 120, in <module>
store = zarr.storage.FsspecStore.from_url(url, read_only=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/etc/ecmwf/nfs/dh1_home_b/nkjf/deep-botany-zarr/zarr_env/lib/python3.12/site-packages/zarr/storage/_fsspec.py", line 260, in from_url
return cls(fs=fs, path=path, read_only=read_only, allowed_exceptions=allowed_exceptions)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/etc/ecmwf/nfs/dh1_home_b/nkjf/deep-botany-zarr/zarr_env/lib/python3.12/site-packages/zarr/storage/_fsspec.py", line 149, in __init__
raise ValueError(f"path argument to FsspecStore must not include scheme ({scheme}://)")
ValueError: path argument to FsspecStore must not include scheme (swift://)
Steps to reproduce
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues
import zarr
url='swift://swift.dkrz.de/some-path/'
store = zarr.storage.FsspecStore.from_url(url, read_only=False)
Additional output
No response
the script you included does not include all the dependencies. Here's a modified version that does:
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# "swiftspec",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues
import zarr
url='swift://swift.dkrz.de/some-path/'
store = zarr.storage.FsspecStore.from_url(url, read_only=False)
the problem is the behavior of the SWIFTFileSystem.strip_protocol function, which doesn't actually strip the protocol:
> /home/bennettd/.cache/uv/archive-v0/KHF2glwiL0DdF34UOji17/lib/python3.12/site-packages/zarr/storage/_fsspec.py(256)from_url()
-> return cls(fs=fs, path=path, read_only=read_only, allowed_exceptions=allowed_exceptions)
(Pdb) path
'swift://swift.dkrz.de/some-path/'
(Pdb) fs._strip_protocol(path)
'swift://swift.dkrz.de/some-path/'
(Pdb)
Thanks @d-v-b ! This sounds like an issue in swiftspec then, I will open an issue there.
However it seems the behavior was there intentional: https://github.com/fsspec/swiftspec/blob/3ae321460997ce198cdd6864387a92a44f9e22a3/swiftspec/core.py#L158
@classmethod
def _strip_protocol(cls, path):
"""For SWIFT, we always want to keep the full URL"""
return path
the http filesystem does something similar. I feel like this method should actually be called _strip_protocol_or_return_the_input_unchanged, and then it would be clear to simply not use it.
these are the relevant zarr code sections for your use case:
https://github.com/zarr-developers/zarr-python/blob/5a24487f09d499c91ce25f24af910c2ede055b5a/src/zarr/storage/_fsspec.py#L250-L256
https://github.com/zarr-developers/zarr-python/blob/5a24487f09d499c91ce25f24af910c2ede055b5a/src/zarr/storage/_fsspec.py#L142-L145
Evidently you can find them by searching for (¯\_(ツ)_/¯)
We already have an exemption for http, we could add one for swift, or we could do something that checks, on the basis of the behavior of _strip_protocol, whether the underlying filesystem wants the protocol in there or not.
I tried adding an exception in the second instance:
if "://" in path and not path.startswith("http") and not path.startswith("swift"):
The first instance does not need an exception, since the swiftspec _strip_protocol returns its argument.
Now I can write zarr directly to swift storage using xarray's ds.to_zarr().
that's good news. it would be nice to avoid explicit special cases here though. maybe we should fall back to generic URL parsing instead of relying on the nearly worthless fs._strip_protocol method
On line 142 above, how about not demanding that the protocol is absent? If some file systems don't want the protocol, path = fs._strip_protocol(path) would remove it for those. If it's really an error to reach line 142 with the protocol still in the path, perhaps check if _strip_protocol changed the path?
This would leave it up to the file systems to strip the protocol if they want to, and you could get rid of the exception also for http. As you @d-v-b said above, this would assume fs._strip_protocol behaves as _strip_protocol_or_return_the_input_unchanged.
The http implementation does the same as the swift one, it just returns the path: https://github.com/fsspec/filesystem_spec/blob/b669a80c94e51cdd960f01f74f5946eb8166b577/fsspec/implementations/http.py#L141-143
I probably don't see the full picture, but I'd like to get this fixed somehow. As it is now this prevents us from writing zarr v3 to SWIFT, (and downgrading zarr to 2.x to write zarr v2 data to SWIFT also doesn't work for us due to missing dimension_separator support, https://github.com/pydata/xarray/issues/10569)
The current behavior, raising an exception if the protocol is included, comes from PR #2348, addressing issue #2342 . I don't understand why it's necessary to raise the exception.
-
If the exception is needed with some protocols, but other protocols (http, https, swift) require the protocol to be present, it seems one has to exclude those protocols before the protocol-present test and exception.
-
If the exception itself isn't necessary, it would be possible to just do
path = fs._strip_protocol(path), and let each filesystem decide to strip the protocol or not. If the exception is there just to warn users when they supply the URL in the wrong format, how aboutstripped_path = fs._strip_protocol(path)followed byif stripped_path != path:and print a warning? Then the warning will be printed if a protocol was supplied for a filesystem that does not want the protocol in the URL.
To me 1) seems like the smaller and safer change to implement. I will try to make a PR and see how the tests react.
I'm trying to pick this up again in PR #3343 @dstansby and @d-v-b.
I've re-based my commits there on the current main, and added a commit to remove the test of the exception that was removed in the first commit. With this the tests pass. Do you think it could be merged like this or is something more needed?
I'll have a look later today
I just stumbled across the same issue (when trying to write to the same object store 😄)
It would be really nice to have a "clean" solution for this; my personal definition for clean would include a dirty hack, as long as it is in the zarr-python library and not a local patch on my system 😉