spatialdata icon indicating copy to clipboard operation
spatialdata copied to clipboard

Harmonize artifical dataset creation

Open berombau opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe. There are multiple artificial dataset creation functions. It should be clear which ones are most useful and when.

Describe the solution you'd like Merge or document the different artificial dataset implementations. Ideally, the default one and the benchmarking one are merged and the ones from libraries using SpatialData can reuse some functionality to make more specific artificial datasets.

Additional context Here is a list of some implementations:

  • spatialdata.datasets.blobs
    • default basic option, slow and limited in use
    • https://github.com/scverse/spatialdata/blob/main/src/spatialdata/datasets.py
  • from benchmarks.utils import make_blobs
    • https://github.com/berombau/spatialdata/blob/benchmark-asv/benchmarks/utils.py
    • very fast using adapted code from https://github.com/napari/napari/blob/195bbd0720fce1bae665cd18ccee5456a095b830/napari/benchmarks/utils.py#L175
  • SOPA blobs
    • https://github.com/gustaveroussy/sopa/blob/f1f5a99ee7f5a9489e511241a3a62bb520ec9860/sopa/utils/data.py#L188
    • more irregular cell shapes, genes from list
  • Harpy cluster_blobs
    • https://github.com/saeyslab/harpy/blob/main/src/sparrow/datasets/cluster_blobs.py
    • multisample, multichannel, ground truth cell type annotation

berombau avatar Nov 27 '24 13:11 berombau

Thanks for tracking this in a issue! I'd add also that spatialdata.datasets.blobs() is used in a lot of tests, so making it faster would lead to faster testing.

LucaMarconato avatar Nov 27 '24 14:11 LucaMarconato