pyslurm PySlurm ignoring some batch job options

Details

Slurm Version: 19.5.0
Python Version: 2.7
Cython Version: 0.29.15
PySlurm Branch: 19-05-0
Linux Distribution: Ubuntu 16.04

Issue

PySlurm seems to ignore some valid "sbatch" parameters when submitting a batch job. Example python code:

import pyslurm

job_opts = {"wrap": "sleep 60",
            "ntasks": 1,
            "cpus_per_task": 8,
            "gres": "gpu:turing:1"}
pyslurm.job().submit_batch_job(job_opts)

Equivalent "sbatch" command line call:

sbatch --wrap="sleep 60" --ntasks=1 --cpus-per-task=8 --gres="gpu:turing:1"

The sbatch command line call behaves as expected (allocates 8 cores and 1 Turing GPU), but the PySlurm code seems to ignore some of the parameters, "cpus_per_task" and "gres" in this example, and only allocates the standard 2 cores with no GPU. I've tested a few other parameters (eg. "job_name", "partition", etc.) and they appear to work correctly, so this seems limited to certain arguments.

I do not see any errors or warnings in any of the slurm logs when run with either method, and no exceptions are thrown when run through PySlurm.

After looking through the pyslurm.pyx file (and the "fill_job_desc_from_opts" function in particular), it seems like the "gres" parameter may not be a supported? However, "cpus_per_task" does appear to be supported, but still is not working.

Any help with this would be greatly appreciated. Also, if I'm right that "gres" is not supported, are there any workarounds or alternative methods for allocating GPUs to batch jobs?

Thanks, Charlie

Feb 29 '20 01:02 cahartsell

I think the issue or bug is on L2679: https://github.com/PySlurm/pyslurm/blob/c50467cd84b8b2dfaed45298ae7d81043dae009d/pyslurm/pyslurm.pyx#L2675-L2683

I think this might be carry over from a previous version and no longer works in this 19.05. If you could help me track down what it should be, we should be able to fix it.

Mar 04 '20 02:03 giovtorres

I don't see any obvious problems with the code snippet you posted, and the job_desc_msg_t datatype defined in both the PySlurm and Slurm repositories still contain the desc.min_cpus field. I'm not very experienced with Slurm, and I did not dive deep enough to find out if this min_cpus field is actually used by the Slurm API code or not. One thing I did notice is that cpus_per_task field is also defined in the job_desc_msg_t structure, but doesn't seem to be used by PySlurm. Instead of translating this argument to min_cpus, can this field be set directly with:

desc.cpus_per_task = job_opts.get("cpus_per_task", 1)

In the meantime, I'm getting around this by translating the dictionary of job options into a command line call and invoking sbatch with the subprocess library. This is not ideal, but is sufficient for my uses. Example code below for anyone else having a similar issue.

from future.utils import iteritems
import subprocess

def submit_job(job_info):
    # Construct sbatch command
    slurm_cmd = ["sbatch"]
    for key, value in iteritems(job_info):
        # Check for special case keys
        if key == "cpus_per_task":
            key = "cpus-per-task"
        if key == "job_name":
            key = "job-name"
        elif key == "script":
            continue
        slurm_cmd.append("--%s=%s" % (key, value))
    slurm_cmd.append(job_info["script"])
    print("Generated slurm batch command: '%s'" % slurm_cmd)

    # Run sbatch command as subprocess.
    try:
        sbatch_output = subprocess.check_output(slurm_cmd)
    except subprocess.CalledProcessError as e:
        # Print error message from sbatch for easier debugging, then pass on exception
        if sbatch_output is not None:
            print("ERROR: Subprocess call output: %s" % sbatch_output)
        raise e

    # Parse job id from sbatch output.
    sbatch_output = sbatch_output.strip("\n ").split()
    for s in sbatch_output:
        if s.isdigit():
            job_id = int(s)
            break

Mar 11 '20 20:03 cahartsell

I am also having trouble submitting a job with anything other than 1 CPU core

    mem = 32000
    cpus = 4
    partition = 'mypartition'
    job_name = "sweet_job_name" 

    awesome_job_opts = {
        'script': sweet_script_name,
        'realmem': mem,
        'cpus-per-task': cpus,
        'partition': partition,
        'job_name': job_name,
    }

    pyslurm.job().submit_batch_job(awesome_job_opts)

This results in a job that has 32,000 MB and but only 1 cpu core.

$ squeue -o "jobid: %A name: %j cpus: %C ram:%m %P %t %M %o" | grep sweet_job_name
jobid: 12345 name: sweet_job_name cpus: 1 ram:32000M mypartition R 1-02:51:10 (null)

Jul 11 '22 19:07 alanhoyle

AHHH! I figured it out from @cahartsell's comment above.

it should be:

'cpus_per_task': cpus,

underscores works, dashes doesn't!

Jul 11 '22 19:07 alanhoyle

pyslurm pyslurm copied to clipboard

PySlurm ignoring some batch job options

Details

Issue

pyslurm
pyslurm copied to clipboard