webknossos-libs icon indicating copy to clipboard operation
webknossos-libs copied to clipboard

Errors during slurm job submission are not propagated

Open daniel-wer opened this issue 10 months ago • 0 comments

Context

  • Affected library: cluster-tools

  • If there is an error during slurm job submission, for example if sbatch complains that the job submission script is invalid, the resulting error is not propagated to the caller, leading to a hanging program.

Exception in thread Thread-323:
Traceback (most recent call last):
  File ".local/share/uv/python/cpython-3.11.10-linux-x86_64-gnu/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File ".venv/lib/python3.11/site-packages/cluster_tools/schedulers/slurm.py", line 577, in run
    job_id = SlurmExecutor.submit_text(script, self.cfut_dir)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/cluster_tools/schedulers/slurm.py", line 248, in submit_text
    job_id, stderr = chcall("sbatch --parsable {}".format(filename))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/cluster_tools/_utils/call.py", line 47, in chcall
    raise CommandError(command, code, stderr)
cluster_tools._utils.call.CommandError: 'sbatch --parsable <redacted>.sh' exited with status 1: 'sbatch: error: memory limit must be provided for shared jobs\nsbatch: error: Batch job submission failed: Invalid feature specification\n'
^C
  • This bug was introduced with the use of job submission threads. Since the submission threads are never joined and there is no special error handling/communication, errors are not propagated.

Expected Behavior

  • The caller of the slurm executor should be notified about the submission error through a raised error

Current Behavior

  • No error is raised on the caller side and no more jobs are submitted leading to an indefinite hang of the program

Steps to Reproduce the bug

  • [ ] Cannot reproduce the bug anymore / needs deeper investigation.
  1. Provoke an sbatch submission error, for example by specifying the slurm strategy and a time or mem resource that is too large or invalid
  2. Caller won't shut down and hang indefinitely

Your Environment for bug

  • Operating System and version: Linux 5.14.21
  • Version of webKnossos-libs (Release or Commit): 0.16.2

daniel-wer avatar Jan 21 '25 12:01 daniel-wer