spatialdata-io
spatialdata-io copied to clipboard
Read `cell_feature_matrix` in market format
Some Xenium datasets deposited in GEO lack h5 file format and instead decided to upload market format. For example: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM8253804
Of course it is possible to ask the authors to update the record. But maybe it is useful and better for other cases like this to add the ability to read the file in market format? Perhaps also then it is possible to add support for zarr in case someone decides to upload that one instead?
For now, I have tried to workaround this but the only solution was to avoid reading the AnnData entirely, read the file with scanpy and then assign it to sadata.tables. Following this code:
sdata = io.xenium(".", cells_table=False, cells_as_circles=False, cells_boundaries=False, nucleus_boundaries=False)
adata = sc.read_10x_mtx("cell_feature_matrix")
sdata.tables["mydata"] = adata
But of course this causes a lot of information to be missing and this may cause other problems. Is there any other way to work around this issue?
A more complete workaround would be something like this:
Define a function _get_tables_and_circles_mtx:
import spatialdata_io as io
from spatialdata_io._constants._constants import XeniumKeys
from spatialdata.models import (TableModel, ShapesModel)
from spatialdata.transformations.transformations import Scale
from spatialdata_io.readers.xenium import (_decode_cell_id_column, _get_polygons)
import json
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
from pathlib import Path
from typing import Any
from anndata import AnnData
import scanpy as sc
def _get_tables_and_circles_mtx(
path: Path, cells_as_circles: bool, specs: dict[str, Any]
) -> AnnData | tuple[AnnData, AnnData]:
#adata = _read_10x_h5(path / XeniumKeys.CELL_FEATURE_MATRIX_FILE)
# replace _read_10x_h5 with scanpy.read_10x_mtx
adata = sc.read_10x_mtx(path / "cell_feature_matrix")
metadata = pd.read_parquet(path / XeniumKeys.CELL_METADATA_FILE)
np.testing.assert_array_equal(metadata.cell_id.astype(str), adata.obs_names.values)
circ = metadata[[XeniumKeys.CELL_X, XeniumKeys.CELL_Y]].to_numpy()
adata.obsm["spatial"] = circ
metadata.drop([XeniumKeys.CELL_X, XeniumKeys.CELL_Y], axis=1, inplace=True)
adata.obs = metadata
adata.obs["region"] = specs["region"]
adata.obs["region"] = adata.obs["region"].astype("category")
adata.obs[XeniumKeys.CELL_ID] = _decode_cell_id_column(adata.obs[XeniumKeys.CELL_ID])
table = TableModel.parse(adata, region=specs["region"], region_key="region", instance_key=str(XeniumKeys.CELL_ID))
if cells_as_circles:
transform = Scale([1.0 / specs["pixel_size"], 1.0 / specs["pixel_size"]], axes=("x", "y"))
radii = np.sqrt(adata.obs[XeniumKeys.CELL_AREA].to_numpy() / np.pi)
circles = ShapesModel.parse(
circ,
geometry=0,
radius=radii,
transformations={"global": transform},
index=adata.obs[XeniumKeys.CELL_ID].copy(),
)
return table, circles
return table
Read the data without reading the tables. This, as above, requires setting as False several arguments.
sdata = io.xenium("data/outs", cells_table=False, cells_as_circles=False, cells_boundaries=False, nucleus_boundaries=False)
sdata
SpatialData object
├── Images
│ ├── 'morphology_focus': DataTree[cyx] (1, 37611, 48362), (1, 18805, 24181), (1, 9402, 12090), (1, 4701, 6045), (1, 2350, 3022)
│ └── 'morphology_mip': DataTree[cyx] (1, 37611, 48362), (1, 18805, 24181), (1, 9402, 12090), (1, 4701, 6045), (1, 2350, 3022)
├── Labels
│ ├── 'cell_labels': DataTree[yx] (37611, 48362), (18805, 24181), (9402, 12090), (4701, 6045), (2350, 3022)
│ └── 'nucleus_labels': DataTree[yx] (37611, 48362), (18805, 24181), (9402, 12090), (4701, 6045), (2350, 3022)
└── Points
└── 'transcripts': DataFrame with shape: (<Delayed>, 10) (3D points)
with coordinate systems:
▸ 'global', with elements:
morphology_focus (Images), morphology_mip (Images), cell_labels (Labels), nucleus_labels (Labels), transcripts (Points)
To read the AnnData table, we need to call the function _get_tables_and_circles_mtx defined above. This requires passing the specs:
f = open("data/outs/experiment.xenium")
specs = json.load(f)
specs["region"] = "cell_circles"
Now we read the AnnData:
table = _get_tables_and_circles_mtx(Path("data/outs"), False, specs)
table
AnnData object with n_obs × n_vars = 118939 × 348
obs: 'cell_id', 'transcript_counts', 'control_probe_counts', 'control_codeword_counts', 'unassigned_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'region'
var: 'gene_ids', 'feature_types'
uns: 'spatialdata_attrs'
obsm: 'spatial'
And assign it to the spatialdata table element:
sdata.tables["table"] = table
sdata
SpatialData object
├── Images
│ ├── 'morphology_focus': DataTree[cyx] (1, 37611, 48362), (1, 18805, 24181), (1, 9402, 12090), (1, 4701, 6045), (1, 2350, 3022)
│ └── 'morphology_mip': DataTree[cyx] (1, 37611, 48362), (1, 18805, 24181), (1, 9402, 12090), (1, 4701, 6045), (1, 2350, 3022)
├── Labels
│ ├── 'cell_labels': DataTree[yx] (37611, 48362), (18805, 24181), (9402, 12090), (4701, 6045), (2350, 3022)
│ └── 'nucleus_labels': DataTree[yx] (37611, 48362), (18805, 24181), (9402, 12090), (4701, 6045), (2350, 3022)
├── Points
│ └── 'transcripts': DataFrame with shape: (<Delayed>, 10) (3D points)
└── Tables
└── 'table': AnnData (118939, 348)
with coordinate systems:
▸ 'global', with elements:
morphology_focus (Images), morphology_mip (Images), cell_labels (Labels), nucleus_labels (Labels), transcripts (Points)
Next we need the cell and nucleus boundaries:
polygons = {}
polygons["nucleus_boundaries"] = _get_polygons(
Path("data/outs"),
XeniumKeys.NUCLEUS_BOUNDARIES_FILE,
specs,
1,
idx=table.obs[str(XeniumKeys.CELL_ID)].copy(),
)
polygons["cell_boundaries"] = _get_polygons(
Path("data/outs"),
XeniumKeys.CELL_BOUNDARIES_FILE,
specs,
1,
idx=table.obs[str(XeniumKeys.CELL_ID)].copy(),
)
polygons
{'nucleus_boundaries': geometry
cell_id
aaaaahbc-1 POLYGON ((3293.75 2474.35, 3292.688 2474.988, ...
aaaajmpj-1 POLYGON ((3314.15 2498.15, 3312.45 2500.062, 3...
aaabmheo-1 POLYGON ((3296.725 2504.525, 3294.6 2505.587, ...
aaabnbla-1 POLYGON ((3320.1 2511.325, 3316.913 2512.175, ...
aaacegbk-1 POLYGON ((2892.55 2530.663, 2890 2532.788, 288...
... ...
oimeleho-1 POLYGON ((9272.438 4277.2, 9270.312 4278.688, ...
oimgahdm-1 POLYGON ((9276.263 4288.25, 9274.987 4289.1, 9...
oimhljno-1 POLYGON ((9272.862 4307.587, 9270.95 4308.438,...
oimhmili-1 POLYGON ((9252.888 4318.425, 9250.763 4320.55,...
oimjecob-1 POLYGON ((9242.263 4290.587, 9241.412 4291.013...
[118939 rows x 1 columns],
'cell_boundaries': geometry
cell_id
aaaaahbc-1 POLYGON ((3290.562 2467.762, 3288.438 2469.675...
aaaajmpj-1 POLYGON ((3307.988 2485.825, 3304.587 2487.738...
aaabmheo-1 POLYGON ((3302.462 2494.538, 3290.562 2502.188...
aaabnbla-1 POLYGON ((3323.712 2508.138, 3319.25 2509.413,...
aaacegbk-1 POLYGON ((2892.125 2525.138, 2890.425 2532.362...
... ...
oimeleho-1 POLYGON ((9275.838 4268.7, 9268.825 4271.888, ...
oimgahdm-1 POLYGON ((9283.7 4285.487, 9278.812 4286.125, ...
oimhljno-1 POLYGON ((9274.562 4300.362, 9265.425 4305.462...
oimhmili-1 POLYGON ((9254.588 4306.737, 9245.025 4318.212...
oimjecob-1 POLYGON ((9235.463 4280.812, 9230.362 4285.913...
[118939 rows x 1 columns]}
Which can be assigned now to the shapes element:
sdata.shapes = polygons
sdata
SpatialData object
├── Images
│ ├── 'morphology_focus': DataTree[cyx] (1, 37611, 48362), (1, 18805, 24181), (1, 9402, 12090), (1, 4701, 6045), (1, 2350, 3022)
│ └── 'morphology_mip': DataTree[cyx] (1, 37611, 48362), (1, 18805, 24181), (1, 9402, 12090), (1, 4701, 6045), (1, 2350, 3022)
├── Labels
│ ├── 'cell_labels': DataTree[yx] (37611, 48362), (18805, 24181), (9402, 12090), (4701, 6045), (2350, 3022)
│ └── 'nucleus_labels': DataTree[yx] (37611, 48362), (18805, 24181), (9402, 12090), (4701, 6045), (2350, 3022)
├── Points
│ └── 'transcripts': DataFrame with shape: (<Delayed>, 10) (3D points)
├── Shapes
│ ├── 'cell_boundaries': GeoDataFrame shape: (118939, 1) (2D shapes)
│ └── 'nucleus_boundaries': GeoDataFrame shape: (118939, 1) (2D shapes)
└── Tables
└── 'table': AnnData (118939, 348)
with coordinate systems:
▸ 'global', with elements:
morphology_focus (Images), morphology_mip (Images), cell_labels (Labels), nucleus_labels (Labels), transcripts (Points), cell_boundaries (Shapes), nucleus_boundaries (Shapes)
As a workaround probably this works. While researching this I found that spatialdata_io.xenium does not use scanpy for reading the h5 file, rather uses it's own implementation. My understanding of the reason is that this is done to check some version in the h5 but it is unclear whether this is critical or it would prevent adding support for mtx/zarr.
Thanks for reporting. I think adding support for .mtx.gz directly in xenium.py would be the cleanest approach.
I think the path for this would be to modify the xenium.py reader to read whatever matrix is available with a fallback option.
- [ ] The order would be
.h5->.mtx.gz->.zarr.zip(this last optional, I'd just put a todo in the code).
I'd use the scanpy APIs. Regarding why are not using scanpy for reading, I added more details here https://github.com/scverse/spatialdata-io/issues/288 (TLDR; we can use it).
If you could make a small PR for this it would be great please 😊
Thanks, so if I understand correctly what you suggest is that _get_tables_and_circles first tries to read h5 file, if not, it will try to use mtx and if not zarr. I guess one way would be to look for the corresponding files in the path. Right now the expected file name for h5 is recorded in XeniumKeys.CELL_FEATURE_MATRIX_FILE. Does this mean adding different keys for mtx and zarr entries? E.g., CELL_FEATURE_MATRIX_H5_FILE, CELL_FEATURE_MATRIX_MTX_FILE, CELL_FEATURE_MATRIX_ZARR_FILE. Or is your suggestion different?
Exactly I'd add the keys and then modify _get_tables_and_circles() to support also the alternative data formats.