Added `filter_table_by_query`
Added filtering by a table query as discussed in #626. Added both a standalone function sd.filter_table_by_query and a method sd.SpatialData.filter_table_by_query.
Function signature
class SpatialData:
...
def filter_by_table_query(
self,
table_name: str,
filter_tables: bool = True,
elements: list[str] | None = None,
obs_expr: Predicates | None = None,
var_expr: Predicates | None = None,
x_expr: Predicates | None = None,
obs_names_expr: Predicates | None = None,
var_names_expr: Predicates | None = None,
layer: str | None = None,
how: Literal["left", "left_exclusive", "inner", "right", "right_exclusive"] = "right",
) -> SpatialData:
sd.filter_by_table_query is the same, but instead of self, you have to provide the SpatialData object of interest.
What expressions can you use?
- Several methods are supported by
narwhals. As long as the method doesn't aggregate.- I know that the following work:
>,>=,<,<=,==,is_in, - And from Expr.str
contains,starts_with,ends_withwork.
- I know that the following work:
What parts can you filter on?
You can filter on the obs and var DataFrame attributes of AnnData.
You can filter on obs_names and var_names. (uses an.obs_names, and an.var_names instead of an.col)
You can filter on the expression matrix X w.r.t layers as well.
Some Examples
# Using the mibitof dataset cause it's small and has a table which covers multiple spatialdata elements.
import spatialdata as sd
import annsel as an
from upath import UPath
mibitof_path = UPath("~/Downloads/mibitof-dataset.zarr")
sdata = sd.read_zarr(mibitof_path)
sdata
SpatialData Repr
SpatialData object, with associated Zarr store: [/Users/srivarra/Downloads/mibitof-dataset.zarr](https://file+.vscode-resource.vscode-cdn.net/Users/srivarra/Downloads/mibitof-dataset.zarr)
├── Images
│ ├── 'point8_image': DataArray[cyx] (3, 1024, 1024)
│ ├── 'point16_image': DataArray[cyx] (3, 1024, 1024)
│ └── 'point23_image': DataArray[cyx] (3, 1024, 1024)
├── Labels
│ ├── 'point8_labels': DataArray[yx] (1024, 1024)
│ ├── 'point16_labels': DataArray[yx] (1024, 1024)
│ └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
└── 'table': AnnData (3309, 36)
with coordinate systems:
▸ 'point8', with elements:
point8_image (Images), point8_labels (Labels)
▸ 'point16', with elements:
point16_image (Images), point16_labels (Labels)
▸ 'point23', with elements:
point23_image (Images), point23_labels (Labels)
For context here is what the table looks like:
AnnData object with n_obs × n_vars = 3309 × 36
obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
uns: 'spatialdata_attrs'
obsm: 'X_scanorama', 'X_umap', 'spatial'
- Filter with respect the donor
"21d7", and filtervar_nameswhere we have"ASCT2","ATP5A"and any marker that starts with "CD".
sd.filter_by_table_query(
sdata,
table_name="table",
obs_expr=an.col("donor") == "21d7",
var_names_expr=(
an.var_names.is_in(["ASCT2", "ATP5A"])
| an.var_names.str.starts_with("CD")
),
x_expr=None,
)
Output
SpatialData object
├── Labels
│ └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
└── 'table': AnnData (1241, 14)
with coordinate systems:
▸ 'point23', with elements:
point23_labels (Labels)
- Filter by batches
"0"and"1".
sdata.filter_by_table_query(
table_name="table",
obs_expr=an.col("batch").is_in(["1", "0"]),
)
Output
SpatialData object
├── Labels
│ ├── 'point8_labels': DataArray[yx] (1024, 1024)
│ └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
└── 'table': AnnData (2286, 36)
with coordinate systems:
▸ 'point8', with elements:
point8_labels (Labels)
▸ 'point23', with elements:
point23_labels (Labels)
- Filter by
obs_nameswhich start with"9"
sd.filter_by_table_query(
sdata,
table_name="table",
obs_names_expr=an.obs_names.str.starts_with("9")
)
Output
SpatialData object
├── Labels
│ └── 'point8_labels': DataArray[yx] (1024, 1024)
└── Tables
└── 'table': AnnData (624, 36)
with coordinate systems:
▸ 'point8', with elements:
point8_labels (Labels)
- Note that tuples of Expressions applies an
&operator
sd.filter_by_table_query(
sdata,
table_name="table",
var_names_expr=(an.var_names.str.contains("CD"), an.var_names == "CD8"),
x_expr=None,
)
Output
SpatialData object
├── Labels
│ ├── 'point8_labels': DataArray[yx] (1024, 1024)
│ ├── 'point16_labels': DataArray[yx] (1024, 1024)
│ └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
└── 'table': AnnData (3309, 1)
with coordinate systems:
▸ 'point8', with elements:
point8_labels (Labels)
▸ 'point16', with elements:
point16_labels (Labels)
▸ 'point23', with elements:
point23_labels (Labels)
- Complex query.
sd.filter_by_table_query(
sdata,
elements=["point23_labels", "point8_labels"],
table_name="table",
# Filter observations (rows) based on multiple conditions
obs_expr=(
# Cells from donor 21d7 OR 90de
an.col("donor").is_in(["21d7", "90de"])
# AND cells with size greater than 400
& (an.col("cell_size") > 400)
# AND cells that are either Epithelial or contain "Tcell" in their cluster name
& (an.col("Cluster") == "Epithelial")
| (an.col("Cluster").str.contains("Tcell"))
),
# Filter variables (columns) based on multiple conditions
var_names_expr=(
# Select columns that start with CD
an.var_names.str.starts_with("CD")
# OR columns that contain "ATP"
| an.var_names.str.contains("ATP")
# OR specific columns
| an.var_names.is_in(["ASCT2", "PKM2", "SMA"])
),
# Filter based on expression values
x_expr=(
# Keep cells where ASCT2 is greater than 0.1
(an.col("ASCT2") > 0.1)
# AND less than 2 for ASCT2
& (an.col("ASCT2") < 2)
),
how="right",
)
Output
SpatialData object
├── Labels
│ ├── 'point8_labels': DataArray[yx] (1024, 1024)
│ └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
└── 'table': AnnData (268, 17)
with coordinate systems:
▸ 'point8', with elements:
point8_labels (Labels)
▸ 'point23', with elements:
point23_labels (Labels)
Other things to note:
I added a more complex SpatialData for testing in conftest.py. I do not know if this should be there or somewhere else, or if I should make better use of what's there currently.
- Any thoughts or suggestions?
- Is this a feature which requires a tutorial notebook or additions to an already existing one?
Notebook: Table Queries
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 92.09%. Comparing base (
60be9ce) to head (1677e35).
Additional details and impacted files
@@ Coverage Diff @@
## main #894 +/- ##
==========================================
- Coverage 92.10% 92.09% -0.01%
==========================================
Files 48 48
Lines 7433 7442 +9
==========================================
+ Hits 6846 6854 +8
- Misses 587 588 +1
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/spatialdata/__init__.py | 96.42% <ø> (ø) |
|
| src/spatialdata/_core/query/relational_query.py | 91.23% <100.00%> (+0.09%) |
:arrow_up: |
| src/spatialdata/_core/spatialdata.py | 91.49% <100.00%> (+0.03%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Hey @srivarra, thanks for the PR! Sorry for not getting to it earlier, last months have been a bit hectic with the PhD, but checking it now. Might incorporate a tutorial notebook given that people could be new to this way of writing queries, but in any case I am in favor of this kind of syntax.
@melonora No worries, hope the PhD is going well! Sounds good I'll draft up a tutorial notebook and request a review when it's done.
Awesome!
@melonora Would the proper process be:
- Make an issue in scverse/spatialdata-notebooks
- Make a PR there for the notebook
- ??? Find a way to link it to this branch? A bit confused on this part.
tyty
@srivarra Just open a PR in scverse/spatialdata-notebooks:) As a title you can give it table queries