Harmonize artifical dataset creation

Open berombau opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe. There are multiple artificial dataset creation functions. It should be clear which ones are most useful and when.

Describe the solution you'd like Merge or document the different artificial dataset implementations. Ideally, the default one and the benchmarking one are merged and the ones from libraries using SpatialData can reuse some functionality to make more specific artificial datasets.

Additional context Here is a list of some implementations:

spatialdata.datasets.blobs
- default basic option, slow and limited in use
- https://github.com/scverse/spatialdata/blob/main/src/spatialdata/datasets.py
from benchmarks.utils import make_blobs
- https://github.com/berombau/spatialdata/blob/benchmark-asv/benchmarks/utils.py
- very fast using adapted code from https://github.com/napari/napari/blob/195bbd0720fce1bae665cd18ccee5456a095b830/napari/benchmarks/utils.py#L175
SOPA blobs
- https://github.com/gustaveroussy/sopa/blob/f1f5a99ee7f5a9489e511241a3a62bb520ec9860/sopa/utils/data.py#L188
- more irregular cell shapes, genes from list
Harpy cluster_blobs
- https://github.com/saeyslab/harpy/blob/main/src/sparrow/datasets/cluster_blobs.py
- multisample, multichannel, ground truth cell type annotation

Nov 27 '24 13:11 berombau

Thanks for tracking this in a issue! I'd add also that spatialdata.datasets.blobs() is used in a lot of tests, so making it faster would lead to faster testing.

Nov 27 '24 14:11 LucaMarconato