submitit icon indicating copy to clipboard operation
submitit copied to clipboard

SLURM Job keeps running after Successful Job Completon (Hydra Submitit Plugin)

Open subho406 opened this issue 2 years ago • 2 comments
trafficstars

Hi,

I am using the Hydra submitit plugin to schedule Sweeps jobs in the Compute Canada cluster. I use the following config to schedule the sweeps:

defaults:
  - _self_
  - override hydra/launcher: submitit_slurm


tags: null
project_name: "test"
seed: 1
steps: 5000000
log_interval: 10000
trainer:
  rollout_len: 256
  num_envs: 8
eval_interval: null
task:
  num_distractors: 6
use_wandb: True

hydra:
  mode: MULTIRUN
  launcher:
      setup:
        - export WANDB_MODE=offline
      account: test
      cpus_per_task: 8
      mem_gb: 5
      timeout_min: 300

  sweeper:
    params:
      trainer/seq_model: lstm, gru
      trainer.optimizer.learning_rate: 0.05, 0.01
      seed: 2,3,4,5

My jobs are executed successfully and they finish before the specified timeout (5 hour). However, it seems like the SLURM job keeps running even though the process has exited. I checked the trainer.log and it seems like submitit is ignoring the SIGTERM signal.

[2023-03-09 10:55:49,583][submitit][INFO] - Job completed successfully
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,586][submitit][WARNING] - Bypassing signal SIGTERM

I'm not sure if this a bug. I was wondering if there a way for the SLURM jobs to be killed before the timeout, after successful job completion? This would help save a lot of resources for other jobs in queue.

System Information:

Linux cedar1.cedar.computecanada.ca 3.10.0-1160.80.1.el7.x86_64 #1 SMP Tue Nov 8 15:48:59 UTC 2022 x86_64 GNU/Linux

subho406 avatar Mar 09 '23 22:03 subho406

FWIW, I am seeing this as well using the local launcher: hydra/launcher=submitit_local

terrykong avatar Mar 14 '23 22:03 terrykong

Resolved in https://github.com/facebookincubator/submitit/issues/1677.

nikhilxb avatar Nov 03 '23 17:11 nikhilxb