spatialdata icon indicating copy to clipboard operation
spatialdata copied to clipboard

Support for private remote object storage

Open berombau opened this issue 1 year ago • 7 comments

Right now, there is support for local .zarr stores and remote stores publically accessible via HTTP or S3. Private remote stores are more difficult, as they need certain options or credentials that are not representable by simply a string or Path. One option is to use a zarr.storage.FSStore, which can have storage_options or any fsspec.spec.AbstractFileSystem.

Two pull requests enable this:

  • Support init of ome_zarr_py.io.ZarrLocation with zarr.storage.FSStore (#349)
  • Support remote private storage by consistent use of substore (#442)

Testing is difficult, but this is what I used:

import spatialdata as sd
import zarr
# works now, requires credentials in ~/.aws/credentials
root = zarr.open('s3://BUCKET/spatial-sandbox/visium_associated_xenium_io.zarr', storage_options = {'client_kwargs': {'endpoint_url': MINIO_URL}})
sd.read_zarr(root)
# still works, I think depends on zmetadata?
sd.read_zarr('https://s3.embl.de/spatialdata/spatialdata-sandbox/visium_associated_xenium_io.zarr/')
# still works
sd.read_zarr('~/visium_associated_xenium_io.zarr')

berombau avatar Jan 26 '24 12:01 berombau

I refactored to use UPath, which solves many issues I had with remote support. So I would recommend UPath over Path, str, ZarrLocation...

It works with my own object storage:

from upath import UPath
from spatialdata import SpatialData

p = UPath(
    "s3://BUCKET/spatial-sandbox/visium_associated_xenium_io_tables.zarr",
    endpoint_url="https://objectstor.vib.be",
)
full_sdata.write(p)
sdata = SpatialData.read(p)

I also added tests for the remote datasets and mocked remote tests. There are still some remaining issues:

  • [x] reading from private remote storage over S3 works
  • [x] writing to private remote storage over S3 works
  • [ ] test_remote_mock.py mock reading test using ome-zarr fails, so images and labels fail. I need to test this some more as I'm also using a patched ome_zarr.
  • [ ] test_remote.py reading the SpatialData remote datasets over HTTP fails for the points parquet files. I also can't reproduce the working implementation (maybe because of a package update?).

I will likely be a while until I can work on this some more.

@LucaMarconato @ArneDefauw

berombau avatar Apr 12 '24 16:04 berombau