mmengine
mmengine copied to clipboard
[Fix] SLURM distributed training in containers where `scontrol` is not available
Motivation
Distributed training with SLURM does not work if you run it in a containerized environment such as docker, enroot or apptainer/singularity on a HPC.
by checking if scontrol
is available and retreiving the master adress without it if necessary, training with SLURM is now possible on the HPC of research center juelich in germany (https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) within a containerized environment. This should generalize to all systems where containers do not have scontrol
available.
Additionally this PR fixes these issues:
https://github.com/open-mmlab/mmcv/pull/1970 https://github.com/open-mmlab/mmcv/issues/700
Modification
two functions are added to mmengine/dist/utils.py
:
_slurm_extract_first_node(slurm_nodelist):
replaces scontrol to set the addr
of the the master node if needed. this is checked via this function:
_is_scontrol_available()
BC-breaking (Optional)
no breaking changes are introduced.
Use cases (Optional)
HPC Training within containerized environments.
Checklist
- Pre-commit or other linting tools are used to fix the potential lint issues.
- The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
- If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDetection or MMPretrain.
- The documentation has been modified accordingly, like docstring or example tutorials.
:heavy_check_mark: