Function to subset the entire spatialdata object
A little bit I started on this PR https://github.com/scverse/squidpy/pull/967 which introduces a function that allows users to subset their entire SpatialData objects by certain criteria. The larger goal is to emulate this Scanpy notebook and to make Squidpy basically the biologist-friendly interface to SpatialData. Subsetting your object will be the first step in that journey.
For this, there are several considerations:
- a given
SpatialDataobject can contain 0-nAnnDataobjects - these
AnnDataobjects can annotate 0-n other objects, f.e. segmentation masks, shapes (like for Visium), ROIs or even points - a given subsetting step on the AnnData object needs to find all instances that are annotated by these soon-to-be-gone observations in all other elements and deal with them accordingly:
- segmentation masks -> set to 0 (background)
- potentially: remove transcript locations falling into these segmentation masks
- shapes -> remove
- points -> remove
- etc
- segmentation masks -> set to 0 (background)
However, there are additional constraints and open questions that are important for the implementation.
- We can f.e. store segmentation masks as DataTrees with different scales - is it faster to subset the original resolution and to then regenerate the tree or subset all scales individually?
- How do we handle inplace
TruevsFalse? Returning a copy can easily mean doubling a 500 GB object.
Some other edge cases might only really show up once there.
Generally, the goal should be to identify relevant subfunctions and push these upstream to SpatialData, some might already exist there and just need to be found (realistically by asking @LucaMarconato, there's quite a few functions only he really knows about), other might need to be written and pushed upstream. Ideally Squidpy then chains together these functions into something with good UX.
We can f.e. store segmentation masks as DataTrees with different scales - is it faster to subset the original resolution and to then regenerate the tree or subset all scales individually?
I'd try benchmark this. It's all lazy anyway until things are written to disk. One important consideration is that if we want to subset and access the data in memory. In that case if we compute all the scales from the first, then we need to write it and reread it (or call .persist() from dask), otherwise the computation of the lower scales from the first scale is re-performed everytime! This would favor slicing every scale instead of computing them from the largest.
How do we handle inplace True vs False? Returning a copy can easily mean doubling a 500 GB object.
No copy is done because the heavy data (images and labels) are lazy. The copy will come only on-disk. So this should not be a problem.
Generally, the goal should be to identify relevant subfunctions and push these upstream to SpatialData, some might already exist there and just need to be found (realistically by asking @LucaMarconato, there's quite a few functions only he really knows about), other might need to be written and pushed upstream. Ideally Squidpy then chains together these functions into something with good UX.
For table-based subsetting this is the most up-to-date code we have: https://github.com/scverse/spatialdata/pull/894
Hi, I already started working on this, I have some questions and opinions.
First of all it seems reasonable to first finish https://github.com/scverse/spatialdata/pull/894 then adapt that to this function. This would avoid lots of duplicate code from what I understand. After all filtering is done with subset operations. If you agree I can help with the PR.
About this
No copy is done because the heavy data (images and labels) are lazy. The copy will come only on-disk. So this should not be a problem.
Even if it is on disk and inplace=False doesn't this mean we have to copy the whole 500gb file on disk to work on it? This comes with lots of problems like what would the data be named and where would it be stored. I feel like throwing an error when the object is on disk and inplace=False is the best thing to do
Re PR894 in SpatialData: I don't think we need to wait for that here since what he's implementing there is a lot more involved than what we're doing here (allowing to str-subset by donor or sth) whereas all functionality needed here is covered by Scanpy's filter_cells using inplace=False so we get a mask that we can then use downstream.
It's all lazy anyway until things are written to disk.
@LucaMarconato yeah, this is sometimes a bit tricky in downstream applications. I've found that I had to manually save out sdata objects after many operations because otherwise f.e. sdata-plot became useless due to insane rendering times for anything - most likely due to then materialising lazy operations again and again.