spatialdata aggregate 3D support + relatively slow on large images
Is your feature request related to a problem? Please describe.
spatialdata.aggregatedoes not support 3D,spatialdata.aggregatecan be slow on large images.spatialdata.aggregatedoes not allow aggregation of a labels layer with a points layer.
Describe the solution you'd like
There is currently no 3D support in spatialdata.aggregate, e.g. aggregate of two raster elements:
from spatialdata.datasets import blobs
from spatialdata import read_zarr
sdata = blobs( length = 2000 )
sdata.write( os.path.join( path, "sdata_blobs.zarr" ) )
sdata = read_zarr( sdata.path )
import spatialdata
import dask.array as da
arr = sdata["blobs_image"].data
arr = da.stack([arr, arr]).transpose(1, 0, 2, 3)
se=spatialdata.models.Image3DModel.parse(
arr
)
sdata["blobs_image_3d"] = se
if sdata.is_backed():
sdata.write_element("blobs_image_3d")
sdata = read_zarr(sdata.path)
arr = sdata["blobs_labels"].data
arr = da.stack([arr, arr])
se=spatialdata.models.Labels3DModel.parse(
arr,
)
sdata["blobs_labels_3d"] = se
if sdata.is_backed():
sdata.write_element("blobs_labels_3d")
sdata = read_zarr(sdata.path)
from spatialdata import aggregate
adata = aggregate(
values=sdata["blobs_image_3d"],
by=sdata["blobs_labels_3d"],
)
adata["table"].to_df() #-> raises NotImplementedError
Aggregate can also be slow on large images:
import spatialdata
import dask.array as da
se=spatialdata.models.Image2DModel.parse(
da.tile( sdata[ "blobs_image" ].data , (1, 20, 20)).rechunk(4000),
)
sdata["blobs_image_large"] = se
if sdata.is_backed():
sdata.write_element("blobs_image_large")
sdata = read_zarr(sdata.path)
se=spatialdata.models.Labels2DModel.parse(
da.tile( sdata[ "blobs_labels" ].data , ( 20, 20)).rechunk(4000),
)
sdata["blobs_labels_large"] = se
if sdata.is_backed():
sdata.write_element("blobs_labels_large")
sdata = read_zarr(sdata.path)
from spatialdata import aggregate
adata = aggregate(
values=sdata["blobs_image_large"],
by=sdata["blobs_labels_large"],
)
adata["table"].to_df() # -->takes around 6m on a mac M2, 32Gb ram
While providing support for 3D with the current implementation using xrspatial.zonal_stats is quite straightforward (i.e. just iterate over the z-stacks), I think it makes sense to think about a custom implementation that does not rely on xrspatial.zonal_stats for aggregation of raster data, and that is faster. E.g. the solution here to do aggregation of an image layer with a labels layer ('sum') https://github.com/saeyslab/harpy/blob/4f557433168f5b4cf8b776c884214b3254e9e6d2/src/sparrow/table/_allocation_intensity.py#L251 , is both much faster (30s on same hardware), and supports 3D.
Aggregate does not support aggregation of points layer by a labels layer:
adata = aggregate(
values=sdata["blobs_points"],
by=sdata["blobs_labels"],
)
adata["table"].to_df() # -> raises not Implementederror
This means we have to convert a labels layer first to a shapes layer before we can do aggregation. Also, because shapes layer has limited support in 3D, supporting aggregation using a labels layer makes sense.
I understand that full 3D support is probably not the main priority, but to future prove some features of spatialdata I think it can be worthwhile to already think about it.
for aggregate points on labels. I've got this walk-around: https://github.com/BioinfoTongLI/Image-ST/blob/main/bin/to_spatialdata.py#L111-L118 it is however not very memory efficient.
for aggregate points on labels. I've got this walk-around: https://github.com/BioinfoTongLI/Image-ST/blob/main/bin/to_spatialdata.py#L111-L118 it is however not very memory efficient.
Hi @BioinfoTongLI , for aggregation of points on labels, I have this memory efficient implementation: https://github.com/saeyslab/harpy/blob/f563bcda27ac0138df94224840496a0d2680008b/src/sparrow/table/_allocation.py#L27
For aggregation of images and labels, we now have this implementation: https://github.com/saeyslab/harpy/blob/f563bcda27ac0138df94224840496a0d2680008b/src/sparrow/utils/_aggregate.py#L16
which basically does all aggregations xrspatial.zonal_stats and dask_image implement, but much faster