spatialdata
spatialdata copied to clipboard
User documentation on how to setup jobs or run large analysis
Is your feature request related to a problem? Please describe. Current documentation shows small examples. Working on real large datasets varies in some ways and has specific needs:
- limiting Dask workers to limit memory usage
- job-based instead of interactive
- sometimes needs to use a specific HPC setup
- requesting resources via SLURM or using workflow managers like Nextflow, Snakemake...
- working with a distributed Dask cluster
Describe the solution you'd like A documentation page should explain this and link to existing resources. It would also be interesting to gather existing documentation of executing large jobs with SpatialData.
Some resources:
- Dask cluster:
- https://docs.dask.org/en/stable/deploying-python.html?#localcluster
- https://docs.dask.org/en/latest/scheduling.html
- https://docs.dask.org/en/latest/deploying-hpc.html
- https://docs.dask.org/en/latest/deploying.html#advanced-understanding
- https://jobqueue.dask.org/en/latest/
- Developing with Python environments on HPC: https://docs.hpc.ugent.be/Linux/setting_up_python_virtual_environments/?h=venv
- SpatialData workflows on HPC:
- Hydra: https://harpy.readthedocs.io/en/latest/tutorials/hpc/index.html
- Nextflow:
- https://github.com/LucaMarconato/spatialdata-mcmicro
- https://nf-co.re/configs/vsc_ugent
- Snakemake: https://gustaveroussy.github.io/sopa/tutorials/snakemake/
Some new notebook ideas:
- [ ] intermediate notebook on limiting workers when segmenting with map apply https://docs.dask.org/en/stable/deploying-python.html?#localcluster
- [ ] advanced notebook on setting up distributed Dask cluster https://jobqueue.dask.org/en/latest/runners-overview.html
- [ ] working with Dask dashboard
- [ ] working with Dask span for fine performance metrics