submitit
submitit copied to clipboard
SLURM Job keeps running after Successful Job Completon (Hydra Submitit Plugin)
Hi,
I am using the Hydra submitit plugin to schedule Sweeps jobs in the Compute Canada cluster. I use the following config to schedule the sweeps:
defaults:
- _self_
- override hydra/launcher: submitit_slurm
tags: null
project_name: "test"
seed: 1
steps: 5000000
log_interval: 10000
trainer:
rollout_len: 256
num_envs: 8
eval_interval: null
task:
num_distractors: 6
use_wandb: True
hydra:
mode: MULTIRUN
launcher:
setup:
- export WANDB_MODE=offline
account: test
cpus_per_task: 8
mem_gb: 5
timeout_min: 300
sweeper:
params:
trainer/seq_model: lstm, gru
trainer.optimizer.learning_rate: 0.05, 0.01
seed: 2,3,4,5
My jobs are executed successfully and they finish before the specified timeout (5 hour). However, it seems like the SLURM job keeps running even though the process has exited. I checked the trainer.log and it seems like submitit is ignoring the SIGTERM signal.
[2023-03-09 10:55:49,583][submitit][INFO] - Job completed successfully
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,586][submitit][WARNING] - Bypassing signal SIGTERM
I'm not sure if this a bug. I was wondering if there a way for the SLURM jobs to be killed before the timeout, after successful job completion? This would help save a lot of resources for other jobs in queue.
System Information:
Linux cedar1.cedar.computecanada.ca 3.10.0-1160.80.1.el7.x86_64 #1 SMP Tue Nov 8 15:48:59 UTC 2022 x86_64 GNU/Linux
FWIW, I am seeing this as well using the local launcher: hydra/launcher=submitit_local
Resolved in https://github.com/facebookincubator/submitit/issues/1677.