spatialdata
spatialdata copied to clipboard
Feedback on concatenate()
While I in the end was able to concatenate the data the way I like, the user experience wasn't as great as I had hoped, so wanted to drop some feedback. As I'm not that familiar with spatialdata yet, it might be that there are already better solutions -- please let me know if there are.
Starting situation
I have ~20 Visium Cytassist samples from a clinical trial processed with nf-core/spatialtranscriptomics (using the https://github.com/nf-core/spatialtranscriptomics/pull/67 branch that already uses spatialdata). The pipeline generates a single .zarr
folder for each sample.
Desired outcome
I would like to have all samples in a single SpatialData object. The AnnData table should contain the gene expression from all samples.
Pain points
-
sd.concatenate
enforces that the input is a list. Is there a reason this can't accept anySequence
type (e.g.dict_values
)? -
Usually, I pass a dictionary
sample_id -> AnnData
toanndata.concat
, which nicely makes unique obs_names in combination withconcat(..., index_unique="_")
. This doesn't work with spatialdata.concatenate, which leaves me with either manipulating theobs_names
for each object before concatenation, or ugly obs names with numeric sufficies (e.g.AACTCAACCTTGACCA-1_0_0_0_0_0_0_0_0_0_0_0_0_0_0_0
). IMO it would be great to support a dict as input to spatialdata.concatenate, too. -
The per-sample SpatialData objects all have the same names for images, shapes and coordinate systems. I currently rename them like this:
sdatas_vis = {} for _, row in tqdm(samplesheet.iterrows(), total=samplesheet.shape[0]): sample = row["sample"] tmp_sd = sd.read_zarr(sample_path / sample / "data" / "sdata_processed.zarr") tmp_sd.tables["table"].obs = tmp_sd.tables["table"].obs.assign(**row) tmp_sd.tables["table"].obs["region"] = sample tmp_sd.tables["table"].uns["spatialdata_attrs"]["region"] = sample # rename images tmp_sd.images[f"{sample}_hires"] = tmp_sd.images["visium_hires_image"] tmp_sd.images[f"{sample}_lowres"] = tmp_sd.images["visium_lowres_image"] del tmp_sd.images["visium_hires_image"] del tmp_sd.images["visium_lowres_image"] # rename shapes tmp_sd.shapes[f"{sample}"] = tmp_sd.shapes["visium"] del tmp_sd.shapes["visium"] sdatas_vis[sample] = tmp_sd
which seems a bit cumbersome. I'm wondering if there's a better solution or what's the intended way of handling such cases. It could also be worth adding a process to the nf-core/spatialtranscriptomics pipeline that already does the concatenation step.
I am a bit swamped at the moment, but I will look into implementing your suggestions. As you said it would be worthwhile to handle dicts.
I have the same issue !
The per-sample SpatialData objects all have the same names for images, shapes and coordinate systems. So when I concatenate them, an keyerror occurred: KeyError: 'Images must have unique names across the SpatialData objects to concatenate'
And it's better to have a way to retrieve (subset) each objects from the concatenated objects.
@wangjiawen2013 does SpatialData.subset()
works for your use case or you would improve something?
What I mean is how to concatenate multi spatialdata objects and subset each objects from the concatenated objects according to the sample names (each object have a unique name) again. The SpatialData objects from xenium all have the same names for images, shapes and coordinate systems, so I cannot concatenate them because KeyError: Images must have unique names across the SpatialData objects to concatenate
.
SpatialData.subset()
can only get elements, not objects.
We can concatenate and subset anndata objects well, what i mean is to concatenate and subset spatialdata objects like anndata objects.
Hi, getting back to this today.
What I mean is how to concatenate multi spatialdata objects and subset each objects from the concatenated objects according to the sample names (each object have a unique name) again. The SpatialData objects from xenium all have the same names for images, shapes and coordinate systems, so I cannot concatenate them because KeyError: Images must have unique names across the SpatialData objects to concatenate. SpatialData.subset() can only get elements, not objects. We can concatenate and subset anndata objects well, what i mean is to concatenate and subset spatialdata objects like anndata objects.
What I suggest here is, for each sample, to map all its geometry to a coordinate system called "sample_XXX", with XXX
being the name/id of the sample. In this way you can easily get the SpatialData
object for that sample using sdata.filter_by_coordinate_system()
. Also subset()
(with filter_table=True
, which is the default) should work. You can see an example of both strategies in action in this notebook (3 samples; 1 image and 1 labels per sample; 1 global table).
Please let me know if it works for you.
@grst
The per-sample SpatialData objects all have the same names for images, shapes and coordinate systems. I currently rename them like this:
...
which seems a bit cumbersome. I'm wondering if there's a better solution or what's the intended way of handling such cases. It could also be worth adding a process to the nf-core/spatialtranscriptomics pipeline that already does the concatenation step.
@wangjiawen2013
I have the same issue ! The per-sample SpatialData objects all have the same names for images, shapes and coordinate systems. So when I concatenate them, an keyerror occurred: KeyError: 'Images must have unique names across the SpatialData objects to concatenate'
Currently what you implemented is basically I would do it. An idea would be to wrap that into an official helper function, or have the concatenate()
function doing this automatically (contributions are appreciated!). But in the long term our approach would be to allow the user to have nested NGFF hierarchies https://github.com/scverse/spatialdata/issues/398. This would solve the problem with unique names because the new element name would be its relative path to the NGFF store root (sample0/image
would be different from sample1/image
).
sd.concatenate enforces that the input is a list. Is there a reason this can't accept any Sequence type (e.g. dict_values)?
- [x] I'll fix this. I'll support
Iterable
so thatmy_dict.values()
will work.
Usually, I pass a dictionary sample_id -> AnnData to anndata.concat, which nicely makes unique obs_names in combination with concat(..., index_unique="_"). This doesn't work with spatialdata.concatenate, which leaves me with either manipulating the obs_names for each object before concatenation, or ugly obs names with numeric sufficies (e.g. AACTCAACCTTGACCA-1_0_0_0_0_0_0_0_0_0_0_0_0_0_0_0). IMO it would be great to support a dict as input to spatialdata.concatenate, too.
- [x] I will try exploring a solution for this.
But one comment on this. In spatialdata
we don't use the .obs
names for a series of reasons:
- what maps the table rows to some geometries is the pair (region, instance_id), the
obs
index is not enough, so we don't use it. What we use are the columns named after theregion_key
andinstance_key
values. - we needed the
obs
indices to be integers, butAnnData
only supported strings.
So, renaming the obs
would come for convenience (or to guarantee that obs
are unique), but not be used in any other spatialdata
API, rather in downstream calls of anndata
/scanpy
APIs. So I wonder if we should proceed as follows:
- [x] if the user passes a
dict
, we don't use the dict (we just use the values), and then we pass the dict toad.concat
. This ensures unique names. - [x] if the user doesn't pass a
dict
we simply call.obs_names_make_unique()
in the resulting tables. If this is not done for instance thejoin_spatialelement_table()
API calls fail.
Ok actually I have implemented all the above. @wangjiawen2013 @grst it would be great if you could try this out please 😊
Good news! I'll try later.