User documentation on how to setup jobs or run large analysis

Open berombau opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe. Current documentation shows small examples. Working on real large datasets varies in some ways and has specific needs:

limiting Dask workers to limit memory usage
job-based instead of interactive
sometimes needs to use a specific HPC setup
requesting resources via SLURM or using workflow managers like Nextflow, Snakemake...
working with a distributed Dask cluster

Describe the solution you'd like A documentation page should explain this and link to existing resources. It would also be interesting to gather existing documentation of executing large jobs with SpatialData.

Some resources:

Dask cluster:
- https://docs.dask.org/en/stable/deploying-python.html?#localcluster
- https://docs.dask.org/en/latest/scheduling.html
- https://docs.dask.org/en/latest/deploying-hpc.html
- https://docs.dask.org/en/latest/deploying.html#advanced-understanding
- https://jobqueue.dask.org/en/latest/
Developing with Python environments on HPC: https://docs.hpc.ugent.be/Linux/setting_up_python_virtual_environments/?h=venv
SpatialData workflows on HPC:
- Hydra: https://harpy.readthedocs.io/en/latest/tutorials/hpc/index.html
- Nextflow:
  - https://github.com/LucaMarconato/spatialdata-mcmicro
  - https://nf-co.re/configs/vsc_ugent
- Snakemake: https://gustaveroussy.github.io/sopa/tutorials/snakemake/

Nov 07 '24 09:11 berombau

Some new notebook ideas:

[ ] intermediate notebook on limiting workers when segmenting with map apply https://docs.dask.org/en/stable/deploying-python.html?#localcluster
[ ] advanced notebook on setting up distributed Dask cluster https://jobqueue.dask.org/en/latest/runners-overview.html
[ ] working with Dask dashboard
- [ ] working with Dask span for fine performance metrics

Nov 13 '24 13:11 berombau