anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Dask and Zarr not loading obsp and obsm from remote s3

Open will-moore opened this issue 1 year ago • 15 comments

Hi,

I'm using @ilan-gold's nice sample code at https://anndata.readthedocs.io/en/latest/tutorials/notebooks/%7Bread,write%7D_dispatched.html to read remote anndata, which is working great when I'm serving data locally via http.

But when I'm using minio to serve the data, I'm not getting the obsp, obsm or uns parts of the AnnData object. See code sample below.

A UI view of the data is at https://deploy-preview-20--ome-ngff-validator.netlify.app/?source=https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table/ (from https://github.com/ome/ome-ngff-validator/pull/20) where I'm using the extra "keys" in e.g. https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table/obsm/.zattrs to load those groups:

{
    "encoding-type": "dict",
    "encoding-version": "0.1.0",
    "keys": [
        "X_scanorama",
        "X_umap",
        "spatial"
    ]
}

Is there any way I can use those 'keys' to load obsp and obsm data?

Thanks!

from typing import Callable, Union
import dask.array as da
import zarr
from anndata import AnnData
from anndata._io.specs import IOSpec
from anndata.compat import H5Array, H5Group, ZarrArray, ZarrGroup

# ** requires anndata==0.9.0.rc1
from anndata.experimental import read_dispatched, read_elem
from zarr.storage import FSStore
from ome_zarr.io import parse_url
StorageType = Union[H5Array, H5Group, ZarrArray, ZarrGroup]


def read_remote_anndata(store: FSStore, name: str) -> AnnData:
    table_group = zarr.group(store=store, path=name)

    def callback(
        func: Callable, elem_name: str, elem: StorageType, iospec: IOSpec
    ) -> AnnData:
        if iospec.encoding_type in (
            "dataframe",
            "csr_matrix",
            "csc_matrix",
            "awkward-array",
        ):
            # Preventing recursing inside of these types
            return read_elem(elem)
        elif iospec.encoding_type == "array":
            return da.from_zarr(elem)
        else:
            return func(elem)

    adata = read_dispatched(table_group, callback=callback)
    return adata

url = "https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/"
name = "regions_table"
store = parse_url(url, mode="r").store
anndata_obj = read_remote_anndata(store, name)

print('anndata_obj', anndata_obj)

# anndata_obj AnnData object with n_obs × n_vars = 1045 × 36
#     obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
#     var: 'mean-0', 'std-0', 'mean-1', 'std-1', 'mean-2', 'std-2'


url = "http://localhost:8000/test_segment.zarr/tables/"
store = parse_url(url, mode="r").store
anndata_obj = read_remote_anndata(store, name)

print('anndata_obj', anndata_obj)

# anndata_obj AnnData object with n_obs × n_vars = 1045 × 36
#     obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
#     var: 'mean-0', 'std-0', 'mean-1', 'std-1', 'mean-2', 'std-2'
#     uns: 'Cluster_colors', 'batch_colors', 'neighbors', 'spatial', 'umap'
#     obsm: 'X_scanorama', 'X_umap', 'spatial'
#     obsp: 'connectivities', 'distances'

will-moore avatar Mar 16 '23 15:03 will-moore

@will-moore Is the store consolidated? I get a 200 on https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table/.zgroup but not on https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table/.zmetadata

ilan-gold avatar Mar 16 '23 15:03 ilan-gold

If not, I should probably make that more obvious as an essential step in the tutorial. @ivirshup That is actually something I've been meaning to mention - we should be able to resolve as @will-moore does the keys of obsm etc. without consolidating metadata at a slight penalty. Should we implement this?

ilan-gold avatar Mar 16 '23 15:03 ilan-gold

Ah, right - that's coming back to the discussion at https://github.com/kevinyamauchi/ome-ngff-tables-prototype/pull/12 where we preferred consolidated metadata not to be part of the NGFF spec. cc @joshmoore

will-moore avatar Mar 16 '23 15:03 will-moore

I mean, y'all are big users and the feature probably makes sense anyway so we probably should implement it. But the penalty is of course a bit higher since you need to open each .zarray for each column.

ilan-gold avatar Mar 16 '23 15:03 ilan-gold

By "implement it" you mean support for "keys"? Would this need to happen in anndata itself, or is there a way to add it in the example above?

This is code I'm adding to https://github.com/ome/ome-zarr-py/pull/256 so all NGFF python users will be able to use it.

will-moore avatar Mar 16 '23 15:03 will-moore

I am not sure @will-moore but "implement it" means using the trick you use in the browser to discover columns, whether that's in this example or in general. At the moment, we iterate over the keys of the incoming obsm object (a dict) to read it and consolidated metadata lets you have those keys.

ilan-gold avatar Mar 16 '23 16:03 ilan-gold

I seem to have got the loading of obsm working, using the custom keys (see below), but obsp is beyond what I can get my head around just now...

I don't know if I'm getting close, or if this is really a lot more complex and we should resign ourselves to using consolidate_metadata if we want this remote access to work?

from typing import Callable, Union
import json

import dask.array as da
import zarr
from anndata import AnnData
from anndata._io.specs import IOSpec
from anndata.compat import H5Array, H5Group, ZarrArray, ZarrGroup

# ** requires anndata==0.9.0.rc1
from anndata.experimental import read_dispatched, read_elem
from zarr.storage import FSStore

from ome_zarr.io import parse_url

StorageType = Union[H5Array, H5Group, ZarrArray, ZarrGroup]

# From https://anndata.readthedocs.io/en/latest/tutorials/notebooks/%7Bread,write%7D_dispatched.html


def read_remote_anndata(store: FSStore, name: str) -> AnnData:
    table_group = zarr.group(store=store, path=name)

    def callback(
        func: Callable, elem_name: str, elem: StorageType, iospec: IOSpec
    ) -> AnnData:
        print("el", elem_name, iospec.encoding_type, elem)
        if iospec.encoding_type in (
            "dataframe",
            "csr_matrix",
            "csc_matrix",
            "awkward-array",
        ):
            # Preventing recursing inside of these types
            return read_elem(elem)
        elif iospec.encoding_type == "array":
            return da.from_zarr(elem)
        elif iospec.encoding_type == "dict" and "obsm" in elem_name:    # or "obsp" in elem_name:
            # load .zattrs
            attrs = store.get(elem_name + "/.zattrs")
            attrs = json.loads(attrs)
            print("attrs", attrs)
            if "keys" in attrs:
                to_return = {}
                for key in attrs["keys"]:
                    print("URL", elem_name + "/" + key)
                    try:
                        arr = da.from_zarr(store, elem_name + "/" + key)
                        to_return[key] = arr
                    except:
                        print("Failed " + elem_name + "/" + key)
                        pass
                return to_return
        # not handled above, call func()
        rsp = func(elem)
        return rsp

    adata = read_dispatched(table_group, callback=callback)
    return adata

url = "https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/"
name = "regions_table"
store = parse_url(url, mode="r").store
anndata_obj = read_remote_anndata(store, name)

print('anndata_obj', anndata_obj)

will-moore avatar Mar 21 '23 17:03 will-moore

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

github-actions[bot] avatar Jun 09 '23 02:06 github-actions[bot]

@ivirshup what do you think about this?

@ivirshup That is actually something I've been meaning to mention - we should be able to resolve as @will-moore does the keys of obsm etc. without consolidating metadata at a slight penalty. Should we implement this?

flying-sheep avatar Jun 13 '23 09:06 flying-sheep

@flying-sheep I am doing this in the upcoming PR already #947

ilan-gold avatar Jun 13 '23 10:06 ilan-gold

Thanks all for following up on this 👍

will-moore avatar Jun 13 '23 11:06 will-moore

Should we mark that PR as Fixes #951 then?

flying-sheep avatar Jun 13 '23 17:06 flying-sheep

Ah, good call. I'll add it.

ilan-gold avatar Jun 14 '23 11:06 ilan-gold

Hi @will-moore as an alternative to https://github.com/scverse/anndata/pull/947 you can also try out https://github.com/scverse/anndata/tree/ig/xarray_compat (or https://github.com/scverse/anndata/pull/1247) which is a bit lighter weight, but similar with a read_backed method. This returns a normal AnnData object but with xarray.

ilan-gold avatar Nov 30 '23 12:11 ilan-gold

Thanks for the update @ilan-gold - Unfortunately I'm not looking to include this work in ome-zarr-py now since the spec proposal isn't going ahead just now https://github.com/ome/ngff/pull/64

will-moore avatar Nov 30 '23 14:11 will-moore