mmengine [Fix] SLURM distributed training in containers where `scontrol` is not available

[Fix] SLURM distributed training in containers where `scontrol` is not available

Open R-Fehler opened this issue 10 months ago • 1 comments

Motivation

Distributed training with SLURM does not work if you run it in a containerized environment such as docker, enroot or apptainer/singularity on a HPC.

by checking if scontrol is available and retreiving the master adress without it if necessary, training with SLURM is now possible on the HPC of research center juelich in germany (https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) within a containerized environment. This should generalize to all systems where containers do not have scontrol available.

Additionally this PR fixes these issues:

https://github.com/open-mmlab/mmcv/pull/1970 https://github.com/open-mmlab/mmcv/issues/700

Modification

two functions are added to mmengine/dist/utils.py:

_slurm_extract_first_node(slurm_nodelist): replaces scontrol to set the addr of the the master node if needed. this is checked via this function: _is_scontrol_available()

BC-breaking (Optional)

no breaking changes are introduced.

Use cases (Optional)

HPC Training within containerized environments.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDetection or MMPretrain.
The documentation has been modified accordingly, like docstring or example tutorials.

:heavy_check_mark:

Apr 03 '24 14:04 R-Fehler

All committers have signed the CLA.

Apr 03 '24 14:04 CLAassistant

mmengine mmengine copied to clipboard

[Fix] SLURM distributed training in containers where `scontrol` is not available

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

mmengine
mmengine copied to clipboard