DeepSpeed [BUG] deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i"

Describe the bug A clear and concise description of what the bug is. deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i"

To Reproduce Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior A clear and concise description of what you expected to happen.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

Processing dataset chunks: 100%|██████████| 106/106 [00:11<00:00,  9.45it/s]
[2024-09-05 04:11:37,288] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.15.2+c210e601, git-hash=c210e601, git-branch=master
[2024-09-05 04:11:37,288] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-05 04:11:37,288] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
hostname: invalid option -- 'I'
Try 'hostname --help' or 'hostname --usage' for more information.
Traceback (most recent call last):
  File "/code/git/learnable-activations/mflow.py", line 429, in <module>
    run_experiment(args)
  File "/code/git/learnable-activations/mflow.py", line 384, in run_experiment
    model_engine, optimizer = prepare_deepspeed_model(model, args)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/git/learnable-activations/mflow.py", line 266, in prepare_deepspeed_model
    model_engine, _, _, _ = deepspeed.initialize(
                            ^^^^^^^^^^^^^^^^^^^^^
  File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/__init__.py", line 144, in initialize
    dist.init_distributed(dist_backend=dist_backend,
  File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 673, in init_distributed
    mpi_discovery(distributed_port=distributed_port, verbose=verbose)
  File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 701, in mpi_discovery
    result = subprocess.check_output(hostname_cmd, shell=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['hostname -I']' returned non-zero exit status 64.

System info (please complete the following information):

OS: Arch
GPU count and types x1 7900xtx
Interconnects (if applicable) one machine
Python version 3.12
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

#!/bin/bash
export OMPI_MCA_accelerator=rocm
mpirun -np 1 --mca accelerator rocm python mflow.py --deepspeed_config ds_config.json --log_interval 100 --batch_size 4 --local_rank -1

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

the offending code:

master_addr = None
    if rank == 0:
        hostname_cmd = ["hostname -I"]
        result = subprocess.check_output(hostname_cmd, shell=True)
        master_addr = result.decode('utf-8').split()[0]
    master_addr = comm.bcast(master_addr, root=0)

Sep 05 '24 09:09 sirus20x6

Hi @sirus20x6 - this issue looks to be similar to this one: https://github.com/microsoft/DeepSpeed/issues/5597

Could you share the output of hostname --help and hostname -V?

Sep 05 '24 15:09 loadams

here you go!

 ~  hostname --help                                                     ✔  11:11:59
Usage: hostname [OPTION...] [NAME]
Show or set the system's host name.

  -a, --aliases              alias names
  -d, --domain               DNS domain name
  -f, --fqdn, --long         DNS host name or FQDN
  -F, --file=FILE            set host name or NIS domain name from FILE
  -i, --ip-addresses         addresses for the host name
  -s, --short                short host name
  -y, --yp, --nis            NIS/YP domain name
  -?, --help                 give this help list
      --usage                give a short usage message
  -V, --version              print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

Report bugs to <[email protected]>.
 ~                                                                      ✔  11:12:02

 ~  hostname -V                                                      64 ✘  11:12:39
hostname (GNU inetutils) 2.5
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Debarshi Ray.
 ~                                                                      ✔  11:13:03

and I believe that the posix way of doing this is actually

getent hosts localhost

because net-utils which is where the hostname binary is from is sort of an old deprecated package even though a lot of people still have it installed because they have a lot of muscle memory around those tools

Sep 05 '24 16:09 sirus20x6

small correction, actually if you just want the first field that posix way of getting loopback is

getent hosts localhost | awk '{ print $1 }'

Sep 05 '24 16:09 sirus20x6

Thanks, @sirus20x6 - we are also looking at switching to just using socket.gethostname() and socket.gethostbyname_ex() to work around this entirely, do you think that would work for your needs?

Sep 05 '24 16:09 loadams

I believe so. Hopefully that will be more cross-platform and resilient

Sep 05 '24 16:09 sirus20x6

If you want, you could test with pip install git+https://github.com/microsoft/deepspeed.git@loadams/update-hostname-I

Sep 05 '24 16:09 loadams

I will test as soon as I get home to my machine!

Sep 05 '24 18:09 sirus20x6

doesn't install

> pip uninstall deepspeed
Found existing installation: deepspeed 0.15.2+c210e601
Uninstalling deepspeed-0.15.2+c210e601:
  Would remove:
    /thearray/git/ComfyUI/comfyvenv/bin/deepspeed
    /thearray/git/ComfyUI/comfyvenv/bin/deepspeed.pt
    /thearray/git/ComfyUI/comfyvenv/bin/ds
    /thearray/git/ComfyUI/comfyvenv/bin/ds_bench
    /thearray/git/ComfyUI/comfyvenv/bin/ds_elastic
    /thearray/git/ComfyUI/comfyvenv/bin/ds_report
    /thearray/git/ComfyUI/comfyvenv/bin/ds_ssh
    /thearray/git/ComfyUI/comfyvenv/bin/dsr
    /thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed-0.15.2+c210e601.dist-info/*
    /thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/*
Proceed (Y/n)? y
  Successfully uninstalled deepspeed-0.15.2+c210e601
(comfyvenv) (base) neuromancer :) > pip install git+https://github.com/microsoft/deepspeed.git@loadams/update-hostname-I
Collecting git+https://github.com/microsoft/deepspeed.git@loadams/update-hostname-I
  Cloning https://github.com/microsoft/deepspeed.git (to revision loadams/update-hostname-I) to /tmp/pip-req-build-lvq7vagu
  Running command git clone --filter=blob:none --quiet https://github.com/microsoft/deepspeed.git /tmp/pip-req-build-lvq7vagu
  Running command git checkout -b loadams/update-hostname-I --track origin/loadams/update-hostname-I
  Switched to a new branch 'loadams/update-hostname-I'
  branch 'loadams/update-hostname-I' set up to track 'origin/loadams/update-hostname-I'.
  Resolved https://github.com/microsoft/deepspeed.git to commit 0d2aada49e58490a5a38867b0475f4b57e12c2ae
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [63 lines of output]
      [2024-09-05 23:23:45,883] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      [2024-09-05 23:23:46,521] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      /thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
        _torch_pytree._register_pytree_node(
      /thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
        _torch_pytree._register_pytree_node(
      DS_BUILD_OPS=0
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/py_ds_aio.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/py_ds_aio.cpp [skipped, no changes]
      Successfully preprocessed all matching files.
      Total number of unsupported CUDA function calls: 0
      
      
      Total number of replaced kernel launches: 0
      /tmp/pip-req-build-lvq7vagu/csrc/adam/fused_adam_frontend.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/adam/fused_adam_frontend.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/includes/compat.h -> /tmp/pip-req-build-lvq7vagu/csrc/includes/compat.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_apply.cuh -> /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_apply_hip.cuh [ok]
      /tmp/pip-req-build-lvq7vagu/csrc/includes/type_shim.h -> /tmp/pip-req-build-lvq7vagu/csrc/includes/type_shim_hip.h [ok]
      /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_adam.cu -> /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_adam.hip [ok]
      Successfully preprocessed all matching files.
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-req-build-lvq7vagu/setup.py", line 198, in <module>
          ext_modules.append(builder.builder())
                             ^^^^^^^^^^^^^^^^^
        File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 699, in builder
          {'cxx': self.strip_empty_entries(self.cxx_args()), \
                                           ^^^^^^^^^^^^^^^
        File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 842, in cxx_args
          CUDA_ENABLE = self.is_cuda_enable()
                        ^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 420, in is_cuda_enable
          assert_no_cuda_mismatch(self.name)
        File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 86, in assert_no_cuda_mismatch
          torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
                                        ^^^^^^^^^^^^^^^^^^^^^^^^
      AttributeError: 'NoneType' object has no attribute 'split'
      Total number of unsupported CUDA function calls: 0
      
      
      Total number of replaced kernel launches: 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Sep 06 '24 04:09 sirus20x6

+1 for the socket approach

replacing the subprocess calls from here in deepspeed/comm/comm.py#L700-L702

with

import socket
master_addr = socket.gethostbyaddr(socket.gethostname())[0]

has been working for me on internal systems

+    import socket
-    master_addr = None
     if rank == 0:
-        hostname_cmd = ["hostname -I"]
-        result = subprocess.check_output(hostname_cmd, shell=True)
-        master_addr = result.decode('utf-8').split()[0]
+        master_addr = socket.gethostbyaddr(socket.gethostname())[0]

also see: #2837

I'd be happy to submit a PR + test further if it would be useful

Sep 10 '24 20:09 saforem2

@saforem2, thanks for offering to help with this. Please see our concerns here. Would appreciate your insights and PR.

Sep 10 '24 21:09 tjruwase

This should be resolved with this PR: #6990. Let me know if you are still having issues with this.

Feb 14 '25 16:02 loadams