spatialdata Added `filter_table_by

Added filtering by a table query as discussed in #626. Added both a standalone function sd.filter_table_by_query and a method sd.SpatialData.filter_table_by_query.

Function signature

class SpatialData:
	...

    def filter_by_table_query(
        self,
        table_name: str,
        filter_tables: bool = True,
        elements: list[str] | None = None,
        obs_expr: Predicates | None = None,
        var_expr: Predicates | None = None,
        x_expr: Predicates | None = None,
        obs_names_expr: Predicates | None = None,
        var_names_expr: Predicates | None = None,
        layer: str | None = None,
        how: Literal["left", "left_exclusive", "inner", "right", "right_exclusive"] = "right",
    ) -> SpatialData:

sd.filter_by_table_query is the same, but instead of self, you have to provide the SpatialData object of interest.

What expressions can you use?

Several methods are supported by narwhals. As long as the method doesn't aggregate.
- I know that the following work: >,>=,<,<=, ==, is_in,
- And from Expr.str contains, starts_with, ends_with work.

What parts can you filter on?

You can filter on the obs and var DataFrame attributes of AnnData.

You can filter on obs_names and var_names. (uses an.obs_names, and an.var_names instead of an.col)

You can filter on the expression matrix X w.r.t layers as well.

Some Examples

# Using the mibitof dataset cause it's small and has a table which covers multiple spatialdata elements.

import spatialdata as sd
import annsel as an
from upath import UPath

mibitof_path = UPath("~/Downloads/mibitof-dataset.zarr")

sdata = sd.read_zarr(mibitof_path)

sdata

SpatialData Repr

SpatialData object, with associated Zarr store: [/Users/srivarra/Downloads/mibitof-dataset.zarr](https://file+.vscode-resource.vscode-cdn.net/Users/srivarra/Downloads/mibitof-dataset.zarr)
├── Images
│     ├── 'point8_image': DataArray[cyx] (3, 1024, 1024)
│     ├── 'point16_image': DataArray[cyx] (3, 1024, 1024)
│     └── 'point23_image': DataArray[cyx] (3, 1024, 1024)
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     ├── 'point16_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (3309, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_image (Images), point8_labels (Labels)
    ▸ 'point16', with elements:
        point16_image (Images), point16_labels (Labels)
    ▸ 'point23', with elements:
        point23_image (Images), point23_labels (Labels)

For context here is what the table looks like:

AnnData object with n_obs × n_vars = 3309 × 36
    obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
    uns: 'spatialdata_attrs'
    obsm: 'X_scanorama', 'X_umap', 'spatial'

Filter with respect the donor "21d7", and filter var_names where we have "ASCT2", "ATP5A" and any marker that starts with "CD".

sd.filter_by_table_query(
    sdata,
    table_name="table",
    obs_expr=an.col("donor") == "21d7",
    var_names_expr=(
        an.var_names.is_in(["ASCT2", "ATP5A"])
        | an.var_names.str.starts_with("CD")
    ),
    x_expr=None,
)

Output

SpatialData object
├── Labels
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (1241, 14)
with coordinate systems:
    ▸ 'point23', with elements:
        point23_labels (Labels)

Filter by batches "0" and "1".


sdata.filter_by_table_query(
    table_name="table",
    obs_expr=an.col("batch").is_in(["1", "0"]),
)

Output

SpatialData object
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (2286, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)
    ▸ 'point23', with elements:
        point23_labels (Labels)

Filter by obs_names which start with "9"

sd.filter_by_table_query(
    sdata,
    table_name="table",
    obs_names_expr=an.obs_names.str.starts_with("9")
)

Output

SpatialData object
├── Labels
│     └── 'point8_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (624, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)

Note that tuples of Expressions applies an & operator


sd.filter_by_table_query(
    sdata,
    table_name="table",
    var_names_expr=(an.var_names.str.contains("CD"), an.var_names == "CD8"),
    x_expr=None,
)

Output

SpatialData object
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     ├── 'point16_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (3309, 1)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)
    ▸ 'point16', with elements:
        point16_labels (Labels)
    ▸ 'point23', with elements:
        point23_labels (Labels)

Complex query.

sd.filter_by_table_query(
    sdata,
    elements=["point23_labels", "point8_labels"],
    table_name="table",
    # Filter observations (rows) based on multiple conditions
    obs_expr=(
        # Cells from donor 21d7 OR 90de
        an.col("donor").is_in(["21d7", "90de"])
        # AND cells with size greater than 400
        & (an.col("cell_size") > 400)
        # AND cells that are either Epithelial or contain "Tcell" in their cluster name
        & (an.col("Cluster") == "Epithelial")
        | (an.col("Cluster").str.contains("Tcell"))
    ),
    # Filter variables (columns) based on multiple conditions
    var_names_expr=(
        # Select columns that start with CD
        an.var_names.str.starts_with("CD")
        # OR columns that contain "ATP"
        | an.var_names.str.contains("ATP")
        # OR specific columns
        | an.var_names.is_in(["ASCT2", "PKM2", "SMA"])
    ),
    # Filter based on expression values
    x_expr=(
        # Keep cells where ASCT2 is greater than 0.1
        (an.col("ASCT2") > 0.1)
        # AND less than 2 for ASCT2
        & (an.col("ASCT2") < 2)
    ),
    how="right",
)

Output

SpatialData object
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (268, 17)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)
    ▸ 'point23', with elements:
        point23_labels (Labels)

Other things to note:

I added a more complex SpatialData for testing in conftest.py. I do not know if this should be there or somewhere else, or if I should make better use of what's there currently.

Any thoughts or suggestions?
Is this a feature which requires a tutorial notebook or additions to an already existing one?

Notebook: Table Queries

Mar 03 '25 07:03 srivarra

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.09%. Comparing base (60be9ce) to head (1677e35).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #894      +/-   ##
==========================================
- Coverage   92.10%   92.09%   -0.01%     
==========================================
  Files          48       48              
  Lines        7433     7442       +9     
==========================================
+ Hits         6846     6854       +8     
- Misses        587      588       +1

Files with missing lines	Coverage Δ
src/spatialdata/__init__.py	`96.42% <ø> (ø)`
src/spatialdata/_core/query/relational_query.py	`91.23% <100.00%> (+0.09%)`	:arrow_up:
src/spatialdata/_core/spatialdata.py	`91.49% <100.00%> (+0.03%)`	:arrow_up:

... and 1 file with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Mar 03 '25 07:03 codecov[bot]

Hey @srivarra, thanks for the PR! Sorry for not getting to it earlier, last months have been a bit hectic with the PhD, but checking it now. Might incorporate a tutorial notebook given that people could be new to this way of writing queries, but in any case I am in favor of this kind of syntax.

May 27 '25 14:05 melonora

@melonora No worries, hope the PhD is going well! Sounds good I'll draft up a tutorial notebook and request a review when it's done.

May 27 '25 17:05 srivarra

Awesome!

May 27 '25 20:05 melonora

@melonora Would the proper process be:

Make an issue in scverse/spatialdata-notebooks
Make a PR there for the notebook
??? Find a way to link it to this branch? A bit confused on this part.

tyty

May 28 '25 21:05 srivarra

@srivarra Just open a PR in scverse/spatialdata-notebooks:) As a title you can give it table queries

Jun 01 '25 12:06 melonora

Added `filter_table_by_query`

Some Examples

Codecov Report