DeepSpeed
DeepSpeed copied to clipboard
CalledProcessError: Command `'['hostname -I']'` died with `<Signals.SIGSEGV: 11>.`
Not sure the cause, but trying to run multi-node training (launching with mpich), I'm getting the following error:
File "/lus/grand/projects/datascience/foremans/locations/polaris/projects/saforem2/Megatron-DeepSpeed/dist.py", line 106, in init_deepspeed
deepspeed.init_distributed()
File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 646, in init_distributed
mpi_discovery(distributed_port=distributed_port, verbose=verbose)
File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 674, in mpi_discovery
result = subprocess.check_output(hostname_cmd, shell=True)
File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/subprocess.py", line 415, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['hostname -I']' died with <Signals.SIGSEGV: 11>.
The error is originating from deepspeed/comm/comm.py
:
https://github.com/microsoft/DeepSpeed/blob/46784cb58edf7bbe9b6bbec95212de7b81e55b01/deepspeed/comm/comm.py#L676
An easy fix would be replacing the
hostname_cmd = ["hostname -I"]
result = subprocess.check_output(hostname_cmd, shell=True)
master_addr = result.decode('utf-8').split()[0]
with
import socket
master_addr = socket.gethostbyaddr(socket.gethostname())[0]