spatialdata icon indicating copy to clipboard operation
spatialdata copied to clipboard

Reading remote .zarr store has issue with reading shapes.parquet file

Open adkinsrs opened this issue 10 months ago • 3 comments

Local representation of the SpatialData object when read in locally. This is a Visium HD dataset that I created originally using spatialdata_io.visium_hd + some post-processing stuff.

SpatialData object, with associated Zarr store: /<path>/11692b64-b34a-4dbe-adc9-784a87a7a856.zarr
├── Images
│     ├── 'spatialdata_hires_image': DataArray[cyx] (3, 4352, 6000)
│     └── 'spatialdata_lowres_image': DataArray[cyx] (3, 435, 600)
├── Shapes
│     └── 'spatialdata_square_008um': GeoDataFrame shape: (127839, 1) (2D shapes)
└── Tables
      ├── 'square_008um': AnnData (127839, 19059)
      └── 'table': AnnData (127839, 19059)
with coordinate systems:
    ▸ 'downscaled_hires', with elements:
        spatialdata_hires_image (Images), spatialdata_square_008um (Shapes)
    ▸ 'downscaled_lowres', with elements:
        spatialdata_lowres_image (Images), spatialdata_square_008um (Shapes)
    ▸ 'global', with elements:
        spatialdata_square_008um (Shapes)

Recommendation: attach a minimal working example Generally, the easier it is for us to reproduce the issue, the faster we can work on it. It is not required, but if you can, please:

Reproducible example

This is a public dataset and the datastore should be downloadable

import spatialdata as sd
rem_path = "https://devel.umgear.org/datasets/spatial/11692b64-b34a-4dbe-adc9-784a87a7a856.zarr"
sdata = sd.read_zarr(rem_path)
# ERROR

This will work read in fine, but has other issues (which I will document in separate tickets)

sdata = sd.read_zarr(rem_path, selection=["images", "tables"])

Describe the bug When I attempt to read in a publicly accessible remote Zarr dataset, it seems that Pyarrow is dropping one of the "/" in the https URI when it comes to the "shapes.parquet" file. I'm not sure if this is an downstream issue on that package's end, or more upstream (including something on my end).

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.12/site-packages/geopandas/io/arrow.py", line 653, in _read_parquet_schema_and_metadata
    schema = parquet.ParquetDataset(path, filesystem=filesystem, **kwargs).schema
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1348, in __init__
    finfo = filesystem.get_file_info(path_or_paths)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 590, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected a local filesystem path, got a URI: 'https:/devel.umgear.org/datasets/spatial/11692b64-b34a-4dbe-adc9-784a87a7a856.zarr/shapes/spatialdata_square_008um/shapes.parquet'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.12/site-packages/spatialdata/_core/spatialdata.py", line 1850, in read
    return read_zarr(file_path, selection=selection)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/spatialdata/_io/io_zarr.py", line 121, in read_zarr
    shapes[subgroup_name] = _read_shapes(f_elem_store)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/spatialdata/_io/io_shapes.py", line 54, in _read_shapes
    geo_df = read_parquet(path)
             ^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/geopandas/io/arrow.py", line 751, in _read_parquet
    schema, metadata = _read_parquet_schema_and_metadata(path, filesystem)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/geopandas/io/arrow.py", line 655, in _read_parquet_schema_and_metadata
    schema = parquet.read_schema(path, filesystem=filesystem)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 2339, in read_schema
    filesystem, where = _resolve_filesystem_and_path(where, filesystem)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pyarrow/fs.py", line 179, in _resolve_filesystem_and_path
    filesystem, path = FileSystem.from_uri(path)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 477, in pyarrow._fs.FileSystem.from_uri
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unrecognized filesystem type in URI: https:/devel.umgear.org/datasets/spatial/11692b64-b34a-4dbe-adc9-784a87a7a856.zarr/shapes/spatialdata_square_008um/shapes.parquet

Expected behavior The SpatialData object is successfully created

Desktop (optional):

  • Tested in MacOS Sequoia 15.3 as well as a Dockerized Ubuntu:jammy image

Additional context Relevant package versions. If you need me to go into a deeper dive, let me know

Python 3.12.7

spatialdata==0.3.0
spatialdata_io==0.1.6
pandas==2.2.1
anndata==0.10.6

adkinsrs avatar Feb 14 '25 15:02 adkinsrs

Thanks @adkinsrs for reporting. We are working on improving remote data support here https://github.com/scverse/spatialdata/pull/842. A disclaimer, our current focus is working towards better integration with OME-NGFF (specifically the support of Zarr v3 (OME-NGFF v0.5) and OME-NGFF RFC-5), therefore we will likely not be able to follow up on this promptly.

Anyway, I bookmarked your issues and when we switch back to the linked PR (and follow up PRs) we will test the code against what described in your issue.

LucaMarconato avatar Feb 17 '25 19:02 LucaMarconato

For the moment any contribution in the form of PRs (building on top of #842), are welcome. On the other hand, you may consider using more lower level functions (e.g. zarr APIs) to access the remote data to circumvent the issues reported.

LucaMarconato avatar Feb 17 '25 19:02 LucaMarconato

Much appreciated @LucaMarconato. For the time being, I have a working implementation that uses local Zarr stores and this works great! I was mostly testing out potential for using remote files so that I may have that flexibility in the future. Anyways, on my end, I also consider it a lower priority. Looking forward to the Zarr v3 work.

adkinsrs avatar Feb 17 '25 19:02 adkinsrs