Harmonize artifical dataset creation
Is your feature request related to a problem? Please describe. There are multiple artificial dataset creation functions. It should be clear which ones are most useful and when.
Describe the solution you'd like Merge or document the different artificial dataset implementations. Ideally, the default one and the benchmarking one are merged and the ones from libraries using SpatialData can reuse some functionality to make more specific artificial datasets.
Additional context Here is a list of some implementations:
- spatialdata.datasets.blobs
- default basic option, slow and limited in use
- https://github.com/scverse/spatialdata/blob/main/src/spatialdata/datasets.py
- from benchmarks.utils import make_blobs
- https://github.com/berombau/spatialdata/blob/benchmark-asv/benchmarks/utils.py
- very fast using adapted code from https://github.com/napari/napari/blob/195bbd0720fce1bae665cd18ccee5456a095b830/napari/benchmarks/utils.py#L175
- SOPA blobs
- https://github.com/gustaveroussy/sopa/blob/f1f5a99ee7f5a9489e511241a3a62bb520ec9860/sopa/utils/data.py#L188
- more irregular cell shapes, genes from list
- Harpy cluster_blobs
- https://github.com/saeyslab/harpy/blob/main/src/sparrow/datasets/cluster_blobs.py
- multisample, multichannel, ground truth cell type annotation
Thanks for tracking this in a issue! I'd add also that spatialdata.datasets.blobs() is used in a lot of tests, so making it faster would lead to faster testing.