sagemaker-training-toolkit
sagemaker-training-toolkit copied to clipboard
Mpi mode sets all nodes to the same SM_CURRENT_HOST
Describe the bug
With mpi mode, all nodes report the same SM_CURRENT_HOST
(which is the master's one).
To reproduce
Run an PyTorch estimator in mpi mode and more than one node. The training entrypoint can simply dump all its environment variables to stdout (which should end-up on Cloudwatch log). From there, we can see that SM_CURRENT_HOST
from all nodes are set to the same value (i.e., the master's), whereas PMIX_HOSTNAME
is set correctly.
Expected behavior
Master node should not propagate its SM_CURRENT_HOST
to the other nodes.
Screenshots or logs If applicable, add screenshots or logs to help explain your problem.

System information PyTorch DLC 1.11.0-gpu-py38
Additional context Add any other context about the problem here.
This patch corrected the SM_CURRENT_HOST
issue on my training jobs.
# https://github.com/aws/sagemaker-training-toolkit/blob/3188a9df7803798defb043a332d789f7474219d0/src/sagemaker_training/mpi.py#L353
for name in self._env_vars:
if name.startswith("SM_"): # New addition
continue # New addition
command.extend(["-x", name])