mmengine icon indicating copy to clipboard operation
mmengine copied to clipboard

[Fix] SLURM distributed training in containers where `scontrol` is not available

Open R-Fehler opened this issue 10 months ago • 1 comments

Motivation

Distributed training with SLURM does not work if you run it in a containerized environment such as docker, enroot or apptainer/singularity on a HPC.

by checking if scontrol is available and retreiving the master adress without it if necessary, training with SLURM is now possible on the HPC of research center juelich in germany (https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) within a containerized environment. This should generalize to all systems where containers do not have scontrol available.

Additionally this PR fixes these issues:

https://github.com/open-mmlab/mmcv/pull/1970 https://github.com/open-mmlab/mmcv/issues/700

Modification

two functions are added to mmengine/dist/utils.py:

_slurm_extract_first_node(slurm_nodelist): replaces scontrol to set the addr of the the master node if needed. this is checked via this function: _is_scontrol_available()

BC-breaking (Optional)

no breaking changes are introduced.

Use cases (Optional)

HPC Training within containerized environments.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  3. If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDetection or MMPretrain.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

:heavy_check_mark:

R-Fehler avatar Apr 03 '24 14:04 R-Fehler

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Apr 03 '24 14:04 CLAassistant