DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

CalledProcessError: Command `'['hostname -I']'` died with `<Signals.SIGSEGV: 11>.`

Open saforem2 opened this issue 2 years ago • 0 comments

Not sure the cause, but trying to run multi-node training (launching with mpich), I'm getting the following error:

  File "/lus/grand/projects/datascience/foremans/locations/polaris/projects/saforem2/Megatron-DeepSpeed/dist.py", line 106, in init_deepspeed
    deepspeed.init_distributed()
  File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 646, in init_distributed
    mpi_discovery(distributed_port=distributed_port, verbose=verbose)
  File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 674, in mpi_discovery
    result = subprocess.check_output(hostname_cmd, shell=True)
  File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['hostname -I']' died with <Signals.SIGSEGV: 11>.

The error is originating from deepspeed/comm/comm.py:

https://github.com/microsoft/DeepSpeed/blob/46784cb58edf7bbe9b6bbec95212de7b81e55b01/deepspeed/comm/comm.py#L676

An easy fix would be replacing the

hostname_cmd = ["hostname -I"]
result = subprocess.check_output(hostname_cmd, shell=True)
master_addr = result.decode('utf-8').split()[0]

with

import socket
master_addr = socket.gethostbyaddr(socket.gethostname())[0]

saforem2 avatar Feb 15 '23 22:02 saforem2