snakemake icon indicating copy to clipboard operation
snakemake copied to clipboard

Executing with --slurm always runs tasks with 1 core

Open sjl opened this issue 2 years ago • 5 comments

Snakemake version: 7.32.3

I'm attempting to use Snakemake to execute jobs using Slurm. Everything runs as expected when running locally, but when using --slurm all of my jobs end up running with only one CPU available (even though (some of) the logs say otherwise).

I've made a minimal example that reproduces the problem reliably for me, but I'll outline it here.

I have a simple Snakefile:

rule foo:
    log:
        "logs/{sample}.log",
    input:
        infile="{sample}.in",
    output:
        outfile="{sample}.out",
    threads: 8
    resources:
        mem_mb=512,
        runtime="1m",
    shell:
        "("
        "echo Running with {threads} threads."
        " && sleep 10"
        " && wc -c {input.infile:q} > {output.outfile:q}"
        " && echo Finished."
        ") >{log} 2>&1"

Running this locally with e.g. snakemake --printshellcmds --cores 8 hello.out works fine, and the process gets the correct number of threads:

$ snakemake --printshellcmds --cores 8 hello.out
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job stats:
job      count
-----  -------
foo          1
total        1

Select jobs to execute...

[Fri Sep 15 14:51:51 2023]
rule foo:
    input: hello.in
    output: hello.out
    log: logs/hello.log
    jobid: 0
    reason: Missing output files: hello.out
    wildcards: sample=hello
    threads: 8
    resources: tmpdir=/tmp, mem_mb=512, mem_mib=489, runtime=1

(echo Running with 8 threads. && sleep 10 && wc -c hello.in > hello.out && echo Finished.) >logs/hello.log 2>&1
[Fri Sep 15 14:52:01 2023]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2023-09-15T145150.500946.snakemake.log
$ cat logs/hello.log
Running with 8 threads.
Finished.

All of that looks good. Now I want to run through slurm, to execute tasks on the cluster nodes. I made a basic profiles/slurm-example/config.yaml to specify some of the options:

default-resources:
  - slurm_account=REDACTED
  - slurm_partition=debug
  - runtime=5
  - mem_mb=1024
jobs: 10
resources:
  - mem_mb=102400
restart-times: 0
max-jobs-per-second: 10
max-status-checks-per-second: 1
latency-wait: 30
keep-going: False
rerun-incomplete: True
printshellcmds: True
scheduler: greedy

And a run.sh to invoke the Snakemake with that profile:

#!/usr/bin/env bash

set -euo pipefail

snakemake --slurm --profile "profiles/slurm-example" "$@"

If I remove the output file and rerun, this time with slurm, the task does successfully run:

$ ./run.sh hello.out
Using profile profiles/slurm-example for setting default command line arguments.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided resources: mem_mb=102400
Job stats:
job      count
-----  -------
foo          1
total        1

Select jobs to execute...

[Fri Sep 15 14:56:09 2023]
rule foo:
    input: hello.in
    output: hello.out
    log: logs/hello.log
    jobid: 0
    reason: Missing output files: hello.out
    wildcards: sample=hello
    threads: 8
    resources: mem_mb=512, mem_mib=489, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, slurm_account=REDACTED, slurm_partition=debug, runtime=1

(echo Running with 8 threads. && sleep 10 && wc -c hello.in > hello.out && echo Finished.) >logs/hello.log 2>&1
Job 0 has been submitted with SLURM jobid 2740953 (log: .snakemake/slurm_logs/rule_foo/2740953.log).
[Fri Sep 15 14:56:49 2023]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2023-09-15T145608.989687.snakemake.log

Note the Running with 8 threads. in the output. That sounds like exactly what I asked for. But if I check the command's output, I see that it actually only ran with one thread:

$ cat logs/hello.log
Running with 1 threads.
Finished.

Even more confusingly, if I look at the .snakemake/slurm_logs/ for that job, I see that Snakemake seems to have been invoked twice, with the outer invocation using the correct number of cores, but the inner one (which actually runs my command) only getting 1:

$ cat .snakemake/slurm_logs/rule_foo/2740953.log
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=512, mem_mib=489, disk_mb=1000, disk_mib=954
Select jobs to execute...
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=512, mem_mib=489, disk_mb=1000, disk_mib=954
Select jobs to execute...

[Fri Sep 15 14:56:19 2023]
rule foo:
    input: hello.in
    output: hello.out
    log: logs/hello.log
    jobid: 0
    reason: Missing output files: hello.out
    wildcards: sample=hello
    resources: mem_mb=512, mem_mib=489, disk_mb=1000, disk_mib=954, tmpdir=/tmp, slurm_account=REDACTED, slurm_partition=debug, runtime=1

(echo Running with 1 threads. && sleep 10 && wc -c hello.in > hello.out && echo Finished.) >logs/hello.log 2>&1
[Fri Sep 15 14:56:29 2023]
Finished job 0.
1 of 1 steps (100%) done
[Fri Sep 15 14:56:29 2023]
Finished job 0.
1 of 1 steps (100%) done

I did some poking around in Snakemake's code to try to understand what's happening. The immediate reason the second invocation only uses 1 core is because available_cpu_count in utils.py returns 1 after reading /proc/self/status, where it finds Cpus_allowed: 008000 (i.e. a single 1 bit set in the mask = one CPU).

With a little more logging, it seems like the outer invocation is being given --slurm-jobstep --jobs 1 but the inner one is not. I think this means the outer one is using SlurmJobstepExecutor and that uses srun -n1 --cpu-bind=q … to invoke the inner snakemake via srun, which actually runs my example script.

Looking at the srun documentation for --cpus-per-task (emphasis mine):

Request that ncpus be allocated per process. This may be useful if the job is multithreaded and requires more than one CPU per task for optimal performance. Explicitly requesting this option implies --exact. The default is one CPU per process and does not imply --exact.

[…]

WARNING: When srun is executed from within salloc or sbatch, there are configurations and options which can result in inconsistent allocations when -c has a value greater than -c on salloc or sbatch. The number of cpus per task specified for salloc or sbatch is not automatically inherited by srun and, if desired, must be requested again, either by specifying --cpus-per-task when calling srun, or by setting the SRUN_CPUS_PER_TASK environment variable.

So I think what is happening is:

  1. On the local machine, I execute snakemake --slurm …. This calls something like sbatch --cpus-per-task 8 for each job (presumably for each job group if I were using those) to invoke snakemake --slurm-jobstep ….
  2. On an allocated node with 8 cores, snakemake --slurm-jobstep … runs (with 8 cores available to it). It sees it has one job to run, and executes it with srun -n1 --cpu-bind=q to invoke snakemake for a single target job.
  3. This inner snakemake instance executes the job, but it only has one core available because:
    • srun was used to invoke it
    • srun doesn't inherit the --cpus-per-task from its parent sbatch
    • --cpus-per-task was not respecified (nor the env var used, as far as I can tell)
    • srun defaults to 1 cpu if not given

I tried updating the srun line to respecify the number of CPUs per task, by changing:

call = f"srun -n1 --cpu-bind=q {self.format_job_exec(job)}"

to:

call = f"srun -n1 --cpus-per-task={job.threads} --cpu-bind=q {self.format_job_exec(job)}"

When I do that and rerun, the job uses all 8 cores as I would expect.

I'm very new to Snakemake and Slurm, so at this point I'm not sure if this is indeed a bug in Snakemake or whether I'm fundamentally misunderstanding something about how snakemake --slurm is supposed to work. If it's the former and I happen to have tracked this down correctly I'd be happy to send a PR (I see the executor has moved to a separate repo in main, but that srun line looks the same). If it's the latter, is there some documentation somewhere that might help clarify things?

Thanks!

sjl avatar Sep 16 '23 17:09 sjl

I am seeing the same behavior on Snakemake versions 7.32.4 with --slurm and 8.1.0 with --executor slurm (snakemake-executor-plugin-slurm 0.1.4). Jobs are scheduled with the right number of cores, but threads inside the rule is always 1.

paudano avatar Jan 10 '24 14:01 paudano

Hi everyone,

I got the same behavior on v7.25.0. As a workaround, I configured cpus_per_task using the set-resources directive in my profile file, and it worked according to scontrol show jobid -dd <jobid>. This is clearly not ideal as the desired number of cores set through the threads directive is still not used in a SLURM context and that I had to duplicate this piece of information (in rules and in the SLURM profile)... but at least I can start SLURM jobs with multiple cores 😞.

For the (very nice and complete) example noted above, I believe that the workaround would require to edit profiles/slurm-example/config.yaml as such (not tested):

slurm: true
default-resources:
  - slurm_account=REDACTED
  - slurm_partition=debug
  - runtime=5
  - mem_mb=1024
jobs: 10

restart-times: 0
max-jobs-per-second: 10
max-status-checks-per-second: 1
latency-wait: 30
keep-going: False
rerun-incomplete: True
printshellcmds: True
scheduler: greedy

set-resources:
  - foo:mem_mb=102400
  - foo:cpus_per_task=8

and to call snakemake via snakemake --profile "profiles/slurm-example".

HTH

bdelepine avatar Jan 26 '24 13:01 bdelepine

@johanneskoester Is there someone that can take on this case? It's a real pain for pipelines using profiles to set resources.

paudano avatar Feb 27 '24 19:02 paudano

I start a Snakemake job with: snakemake -j 20 --profile profiles/slurm

The job outputs the command it's going to run to the screen where I ran Snakemake (note "-num_threads 16"): blastn -db /path/to/blastdb/2.13.0/hg38_no_alt -num_threads 16 -outfmt "6 saccver sstart send qseqid qstart qend sstrand evalue mismatch gaps pident" -word_size 16 -perc_identity 95 -query temp/HG00733-h1/input/fa_part/part_15.fa | /path/to/filter_blast.py -b data/ref/filter.bed.gz -t /path/to/temp/working_temp/blast_15 -c 5000 -v | gzip > temp/HG00733-h1/blast/part/part_15.bed.gz

When I look at the log output for that job, I see this:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=32768, mem_mib=31250, disk_mb=1000, disk_mib=954
Select jobs to execute...
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=32768, mem_mib=31250, disk_mb=1000, disk_mib=954
Select jobs to execute...

[Tue Feb 27 14:28:54 2024]
rule dmap_blast_run:
    input: temp/HG00733-h1/input/fa_part/part_15.fa, data/ref/filter.bed.gz
    output: temp/HG00733-h1/blast/part/part_15.bed.gz
    jobid: 0
    reason: Missing output files: temp/HG00733-h1/blast/part/part_15.bed.gz
    wildcards: sample=HG00733-h1, part=15
    resources: mem_mb=32768, mem_mib=31250, disk_mb=1000, disk_mib=954, tmpdir=/tmp, runtime=480

blastn -db path/to/blastdb/2.13.0/hg38_no_alt -num_threads 1 -outfmt "6 saccver sstart send qseqid qstart qend sstrand evalue mismatch gaps pident" -word_size 16 -perc_identity 95 -query temp/HG00733-h1/input/fa_part/part_15.fa | /path/to/filter_blast.py -b data/ref/filter.bed.gz -t /path/to/temp/working_temp/blast_15 -c 5000 -v | gzip > temp/HG00733-h1/blast/part/part_15.bed.gz

Note that the command change to "-num_threads 1". The start of the job says "Provided cores: 16", then it changes to "Provided cores: 1 (use --cores to define parallelism)" a few lines later.

Am I running something incorrectly?

paudano avatar Feb 27 '24 19:02 paudano

See report in the executor plugin.

A fix for the executor plugin is in progress. Fixing this downstream is not on the agenda. Sorry for not noticing this issue.

cmeesters avatar Feb 29 '24 10:02 cmeesters