Reading remote .zarr store has issue with reading shapes.parquet file
Local representation of the SpatialData object when read in locally. This is a Visium HD dataset that I created originally using spatialdata_io.visium_hd + some post-processing stuff.
SpatialData object, with associated Zarr store: /<path>/11692b64-b34a-4dbe-adc9-784a87a7a856.zarr
├── Images
│ ├── 'spatialdata_hires_image': DataArray[cyx] (3, 4352, 6000)
│ └── 'spatialdata_lowres_image': DataArray[cyx] (3, 435, 600)
├── Shapes
│ └── 'spatialdata_square_008um': GeoDataFrame shape: (127839, 1) (2D shapes)
└── Tables
├── 'square_008um': AnnData (127839, 19059)
└── 'table': AnnData (127839, 19059)
with coordinate systems:
▸ 'downscaled_hires', with elements:
spatialdata_hires_image (Images), spatialdata_square_008um (Shapes)
▸ 'downscaled_lowres', with elements:
spatialdata_lowres_image (Images), spatialdata_square_008um (Shapes)
▸ 'global', with elements:
spatialdata_square_008um (Shapes)
Recommendation: attach a minimal working example Generally, the easier it is for us to reproduce the issue, the faster we can work on it. It is not required, but if you can, please:
Reproducible example
This is a public dataset and the datastore should be downloadable
import spatialdata as sd
rem_path = "https://devel.umgear.org/datasets/spatial/11692b64-b34a-4dbe-adc9-784a87a7a856.zarr"
sdata = sd.read_zarr(rem_path)
# ERROR
This will work read in fine, but has other issues (which I will document in separate tickets)
sdata = sd.read_zarr(rem_path, selection=["images", "tables"])
Describe the bug When I attempt to read in a publicly accessible remote Zarr dataset, it seems that Pyarrow is dropping one of the "/" in the https URI when it comes to the "shapes.parquet" file. I'm not sure if this is an downstream issue on that package's end, or more upstream (including something on my end).
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.12/site-packages/geopandas/io/arrow.py", line 653, in _read_parquet_schema_and_metadata
schema = parquet.ParquetDataset(path, filesystem=filesystem, **kwargs).schema
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1348, in __init__
finfo = filesystem.get_file_info(path_or_paths)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_fs.pyx", line 590, in pyarrow._fs.FileSystem.get_file_info
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected a local filesystem path, got a URI: 'https:/devel.umgear.org/datasets/spatial/11692b64-b34a-4dbe-adc9-784a87a7a856.zarr/shapes/spatialdata_square_008um/shapes.parquet'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/homebrew/lib/python3.12/site-packages/spatialdata/_core/spatialdata.py", line 1850, in read
return read_zarr(file_path, selection=selection)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/spatialdata/_io/io_zarr.py", line 121, in read_zarr
shapes[subgroup_name] = _read_shapes(f_elem_store)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/spatialdata/_io/io_shapes.py", line 54, in _read_shapes
geo_df = read_parquet(path)
^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/geopandas/io/arrow.py", line 751, in _read_parquet
schema, metadata = _read_parquet_schema_and_metadata(path, filesystem)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/geopandas/io/arrow.py", line 655, in _read_parquet_schema_and_metadata
schema = parquet.read_schema(path, filesystem=filesystem)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 2339, in read_schema
filesystem, where = _resolve_filesystem_and_path(where, filesystem)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/pyarrow/fs.py", line 179, in _resolve_filesystem_and_path
filesystem, path = FileSystem.from_uri(path)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_fs.pyx", line 477, in pyarrow._fs.FileSystem.from_uri
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unrecognized filesystem type in URI: https:/devel.umgear.org/datasets/spatial/11692b64-b34a-4dbe-adc9-784a87a7a856.zarr/shapes/spatialdata_square_008um/shapes.parquet
Expected behavior The SpatialData object is successfully created
Desktop (optional):
- Tested in MacOS Sequoia 15.3 as well as a Dockerized Ubuntu:jammy image
Additional context Relevant package versions. If you need me to go into a deeper dive, let me know
Python 3.12.7
spatialdata==0.3.0
spatialdata_io==0.1.6
pandas==2.2.1
anndata==0.10.6
Thanks @adkinsrs for reporting. We are working on improving remote data support here https://github.com/scverse/spatialdata/pull/842. A disclaimer, our current focus is working towards better integration with OME-NGFF (specifically the support of Zarr v3 (OME-NGFF v0.5) and OME-NGFF RFC-5), therefore we will likely not be able to follow up on this promptly.
Anyway, I bookmarked your issues and when we switch back to the linked PR (and follow up PRs) we will test the code against what described in your issue.
For the moment any contribution in the form of PRs (building on top of #842), are welcome. On the other hand, you may consider using more lower level functions (e.g. zarr APIs) to access the remote data to circumvent the issues reported.
Much appreciated @LucaMarconato. For the time being, I have a working implementation that uses local Zarr stores and this works great! I was mostly testing out potential for using remote files so that I may have that flexibility in the future. Anyways, on my end, I also consider it a lower priority. Looking forward to the Zarr v3 work.