sagemaker-training-toolkit Mpi mode sets all nodes to the same SM_CURRENT

Mpi mode sets all nodes to the same SM_CURRENT_HOST

Open verdimrc opened this issue 2 years ago • 0 comments

Describe the bug With mpi mode, all nodes report the same SM_CURRENT_HOST (which is the master's one).

To reproduce Run an PyTorch estimator in mpi mode and more than one node. The training entrypoint can simply dump all its environment variables to stdout (which should end-up on Cloudwatch log). From there, we can see that SM_CURRENT_HOST from all nodes are set to the same value (i.e., the master's), whereas PMIX_HOSTNAME is set correctly.

Expected behavior Master node should not propagate its SM_CURRENT_HOST to the other nodes.

Screenshots or logs If applicable, add screenshots or logs to help explain your problem.

System information PyTorch DLC 1.11.0-gpu-py38

Additional context Add any other context about the problem here.

This patch corrected the SM_CURRENT_HOST issue on my training jobs.

# https://github.com/aws/sagemaker-training-toolkit/blob/3188a9df7803798defb043a332d789f7474219d0/src/sagemaker_training/mpi.py#L353
        for name in self._env_vars:
            if name.startswith("SM_"):    # New addition
                continue                  # New addition
            command.extend(["-x", name])

Oct 31 '22 04:10 verdimrc

sagemaker-training-toolkit sagemaker-training-toolkit copied to clipboard

Mpi mode sets all nodes to the same SM_CURRENT_HOST

sagemaker-training-toolkit
sagemaker-training-toolkit copied to clipboard