pyslurm
pyslurm copied to clipboard
PySlurm ignoring some batch job options
Details
- Slurm Version: 19.5.0
- Python Version: 2.7
- Cython Version: 0.29.15
- PySlurm Branch: 19-05-0
- Linux Distribution: Ubuntu 16.04
Issue
PySlurm seems to ignore some valid "sbatch" parameters when submitting a batch job. Example python code:
import pyslurm
job_opts = {"wrap": "sleep 60",
"ntasks": 1,
"cpus_per_task": 8,
"gres": "gpu:turing:1"}
pyslurm.job().submit_batch_job(job_opts)
Equivalent "sbatch" command line call:
sbatch --wrap="sleep 60" --ntasks=1 --cpus-per-task=8 --gres="gpu:turing:1"
The sbatch command line call behaves as expected (allocates 8 cores and 1 Turing GPU), but the PySlurm code seems to ignore some of the parameters, "cpus_per_task" and "gres" in this example, and only allocates the standard 2 cores with no GPU. I've tested a few other parameters (eg. "job_name", "partition", etc.) and they appear to work correctly, so this seems limited to certain arguments.
I do not see any errors or warnings in any of the slurm logs when run with either method, and no exceptions are thrown when run through PySlurm.
After looking through the pyslurm.pyx file (and the "fill_job_desc_from_opts" function in particular), it seems like the "gres" parameter may not be a supported? However, "cpus_per_task" does appear to be supported, but still is not working.
Any help with this would be greatly appreciated. Also, if I'm right that "gres" is not supported, are there any workarounds or alternative methods for allocating GPUs to batch jobs?
Thanks, Charlie
I think the issue or bug is on L2679: https://github.com/PySlurm/pyslurm/blob/c50467cd84b8b2dfaed45298ae7d81043dae009d/pyslurm/pyslurm.pyx#L2675-L2683
I think this might be carry over from a previous version and no longer works in this 19.05. If you could help me track down what it should be, we should be able to fix it.
I don't see any obvious problems with the code snippet you posted, and the job_desc_msg_t datatype defined in both the PySlurm and Slurm repositories still contain the desc.min_cpus field. I'm not very experienced with Slurm, and I did not dive deep enough to find out if this min_cpus field is actually used by the Slurm API code or not. One thing I did notice is that cpus_per_task field is also defined in the job_desc_msg_t structure, but doesn't seem to be used by PySlurm. Instead of translating this argument to min_cpus, can this field be set directly with:
desc.cpus_per_task = job_opts.get("cpus_per_task", 1)
In the meantime, I'm getting around this by translating the dictionary of job options into a command line call and invoking sbatch with the subprocess library. This is not ideal, but is sufficient for my uses. Example code below for anyone else having a similar issue.
from future.utils import iteritems
import subprocess
def submit_job(job_info):
# Construct sbatch command
slurm_cmd = ["sbatch"]
for key, value in iteritems(job_info):
# Check for special case keys
if key == "cpus_per_task":
key = "cpus-per-task"
if key == "job_name":
key = "job-name"
elif key == "script":
continue
slurm_cmd.append("--%s=%s" % (key, value))
slurm_cmd.append(job_info["script"])
print("Generated slurm batch command: '%s'" % slurm_cmd)
# Run sbatch command as subprocess.
try:
sbatch_output = subprocess.check_output(slurm_cmd)
except subprocess.CalledProcessError as e:
# Print error message from sbatch for easier debugging, then pass on exception
if sbatch_output is not None:
print("ERROR: Subprocess call output: %s" % sbatch_output)
raise e
# Parse job id from sbatch output.
sbatch_output = sbatch_output.strip("\n ").split()
for s in sbatch_output:
if s.isdigit():
job_id = int(s)
break
I am also having trouble submitting a job with anything other than 1 CPU core
mem = 32000
cpus = 4
partition = 'mypartition'
job_name = "sweet_job_name"
awesome_job_opts = {
'script': sweet_script_name,
'realmem': mem,
'cpus-per-task': cpus,
'partition': partition,
'job_name': job_name,
}
pyslurm.job().submit_batch_job(awesome_job_opts)
This results in a job that has 32,000 MB and but only 1 cpu core.
$ squeue -o "jobid: %A name: %j cpus: %C ram:%m %P %t %M %o" | grep sweet_job_name
jobid: 12345 name: sweet_job_name cpus: 1 ram:32000M mypartition R 1-02:51:10 (null)
AHHH! I figured it out from @cahartsell's comment above.
it should be:
'cpus_per_task': cpus,
underscores works, dashes doesn't!