snakemake icon indicating copy to clipboard operation
snakemake copied to clipboard

Workflow not being executed in parallel on batch computing nodes (SLURM cluster grouped execution)

Open gipert opened this issue 2 years ago • 5 comments

I'm trying to write a profile to run this workflow on NERSC's supercomputer. Batch computing nodes have 128 CPU cores (x2 hyperthreads) and 512 GB of memory. Submission is managed through SLURM and maximum wall time is 12h.

My workflow is mostly composed by a large number of ~1h long, single-threaded jobs. I would like to instruct Snakemake to pack them efficiently and submit a much lower number of jobs to SLURM. Workflows running on a node should profit from all available resources and run in parallel.

This is what I've written so far:

configfile: config.json
keep-going: true
quiet: rules

# profit from Perlmutter's scratch area: https://docs.nersc.gov/filesystems/perlmutter-scratch
# NOTE: should actually set this through the command line, since there is a
# scratch directory for each user and variable expansion does not work here:
#   $ snakemake --shadow-prefix "$PSCRATCH" [...]
# shadow-prefix: "$PSCRATCH"

# NERSC uses the SLURM job scheduler
# - https://snakemake.readthedocs.io/en/stable/executing/cluster.html#executing-on-slurm-clusters
slurm: true

# maximum number of cores requested from the cluster or cloud scheduler
cores: 256
# maximum number of cores used locally, on the interactive node
local-cores: 256
# maximum number of jobs that can exist in the SLURM queue at a time
jobs: 50

# reasonable defaults that do not stress the scheduler
max-jobs-per-second: 20
max-status-checks-per-second: 20

# (LEGEND) NERSC-specific settings
# - https://snakemake.readthedocs.io/en/stable/executing/cluster.html#advanced-resource-specifications
# - https://docs.nersc.gov/jobs
default-resources:
  - slurm_account="m2676"
  - constraint="cpu"
  - runtime=120
  - mem_mb=500
  - slurm_extra="--qos regular --licenses scratch,cfs"

# number of threads used by each rule
set-threads:
  - tier_ver=1
  - tier_raw=1

# memory and runtime requirements for each single rule
# - https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources
# - https://docs.nersc.gov/jobs/#available-memory-for-applications-on-compute-nodes
set-resources:
  - tier_ver:mem_mb=500
  - tier_ver:runtime=120
  - tier_raw:mem_mb=500
  - tier_raw:runtime=120

# we define groups in order to let Snakemake group rule instances in the same
# SLURM job. relevant docs:
# - https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-grouping
# - https://snakemake.readthedocs.io/en/stable/executing/grouping.html#job-grouping
groups:
  - tier_ver=sims
  - tier_raw=sims

# disconnected parts of the workflow can run in parallel (at most 256 of them)
# in a group
group-components:
    - sims=256

And this is the relevant part of Snakemake's output:

> snakemake --profile workflow/profiles/nersc-batch --verbose
sbatch call: sbatch --job-name 02d1132e-27d6-4d5c-aed4-3a88e1d30e93 -o .snakemake/slurm_logs/group_sims/%j.log --export=ALL -A m2676 -t 120 -C cpu --mem 20000 --cpus-per-task=40 --qos regular --licenses scratch,cfs -D /global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1 --wrap='/global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/tools/snakemake-mambaforge3/envs/snakemake/bin/python3.11 -m snakemake --snakefile '"'"'/global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1/workflow/Snakefile'"'"' --target-jobs [ELIDED] --allowed-rules [tier_raw ... ELIDED ... tier_raw] --local-groupid '"'"'eaff24b4-a825-52a7-9d08-aac77f1f7b10'"'"' --cores '"'"'all'"'"' --attempt 1 --resources '"'"'mem_mb=20000'"'"' '"'"'disk_mib=38160'"'"' '"'"'disk_mb=40000'"'"' '"'"'mem_mib=19080'"'"' --wait-for-files-file '"'"'/global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1/.snakemake/tmp.gg2t0bhd/snakejob_sims_eaff24b4-a825-52a7-9d08-aac77f1f7b10.waitforfilesfile.txt'"'"' --force --keep-target-files --keep-remote --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --rerun-triggers '"'"'input'"'"' '"'"'code'"'"' '"'"'mtime'"'"' '"'"'params'"'"' '"'"'software-env'"'"' --skip-script-cleanup  --shadow-prefix '"'"'/pscratch/sd/p/pertoldi'"'"' --conda-frontend '"'"'mamba'"'"' --wrapper-prefix '"'"'https://github.com/snakemake/snakemake-wrappers/raw/'"'"' --configfiles '"'"'/global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1/config.json'"'"' --latency-wait 5 --scheduler '"'"'greedy'"'"' --scheduler-solver-path '"'"'/global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/tools/snakemake-mambaforge3/envs/snakemake/bin'"'"' --set-resources '"'"'tier_ver:mem_mb=500'"'"' '"'"'tier_ver:runtime=120'"'"' '"'"'tier_raw:mem_mb=500'"'"' '"'"'tier_raw:runtime=120'"'"' --default-resources '"'"'mem_mb=500'"'"' '"'"'disk_mb=max(2*input.size_mb, 1000)'"'"' '"'"'tmpdir=system_tmpdir'"'"' '"'"'slurm_account="m2676"'"'"' '"'"'constraint="cpu"'"'"' '"'"'runtime=120'"'"' '"'"'slurm_extra="--qos regular --licenses scratch,cfs"'"'"'  --slurm-jobstep --jobs 1 --mode 2'
Job eaff24b4-a825-52a7-9d08-aac77f1f7b10 has been submitted with SLURM jobid 10939730 (log: .snakemake/slurm_logs/group_sims/10939730.log).

And this is the content of that log file:

> cat .snakemake/slurm_logs/group_sims/10939730.log
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 256
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=20000, disk_mib=38160, disk_mb=40000, mem_mib=19080
Select jobs to execute...
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=500, mem_mib=477, disk_mb=1000, disk_mib=954
Select jobs to execute...

[Sat Jul  1 09:28:58 2023]
Job 0: Producing output file for job 'raw.l200a-wls-reflector-Rn222-to-Po214.0'
Reason: Missing output files: /global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1/generated/tier/raw/l200a-fibers-Rn222-to-Po214/l200a-fibers-Rn222-to-Po214_0000.root

Changing to shadow directory: /pscratch/sd/p/pertoldi/shadow/tmpb727ncaf
Write-protecting output file /global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1/generated/tier/raw/l200a-fibers-Rn222-to-Po214/l200a-fibers-Rn222-to-Po214_0000.root.
[Sat Jul  1 09:32:15 2023]
Finished job 0.
1 of 1 steps (100%) done
Write-protecting output file /global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1/generated/tier/raw/l200a-fibers-Rn222-to-Po214/l200a-fibers-Rn222-to-Po214_0000.root.
[Sat Jul  1 09:32:16 2023]
Finished job 23.
1 of 40 steps (2%) done
Select jobs to execute...
srun: Job 10939730 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for StepId=10939730.1
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=500, mem_mib=477, disk_mb=1000, disk_mib=954
Select jobs to execute...

[Sat Jul  1 09:33:15 2023]
Job 0: Producing output file for job 'raw.l200a-wls-reflector-Rn222-to-Po214.0'
[...]

As you can see, jobs are serially executed on the node even if they are independent between each other.

What's wrong in my profile?

gipert avatar Jul 01 '23 16:07 gipert

Update: removing the --slurm-jobstep at the end of the Snakemake command being executed on the batch node seems to fix the issue. That option takes care of prepending the right srun call:

https://github.com/snakemake/snakemake/blob/bad91152eeb70693e1459324f738a8c481378801/snakemake/executors/slurm/slurm_jobstep.py#L106

but why does this produce a serial execution?

gipert avatar Jul 02 '23 11:07 gipert

Seems like I'm experiencing the same issue reported here: https://github.com/snakemake/snakemake/issues/2060

gipert avatar Jul 02 '23 17:07 gipert

sorry for looking to late into this issue - since Snakemake v8 the executor code for SLURM has its own repo.

Does the issue persist for you after updating?

cmeesters avatar May 06 '24 09:05 cmeesters

I need to check again. Is this https://github.com/snakemake/snakemake-executor-plugin-slurm/issues/29 resolved?

gipert avatar May 06 '24 09:05 gipert

I had a similar case and v7.32.3 had the same problem while v8.23.2 works as expected.

pachi avatar Oct 18 '24 15:10 pachi