DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i"

Open sirus20x6 opened this issue 1 year ago • 10 comments

Describe the bug A clear and concise description of what the bug is. deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i"

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

Processing dataset chunks: 100%|██████████| 106/106 [00:11<00:00,  9.45it/s]
[2024-09-05 04:11:37,288] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.15.2+c210e601, git-hash=c210e601, git-branch=master
[2024-09-05 04:11:37,288] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-05 04:11:37,288] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
hostname: invalid option -- 'I'
Try 'hostname --help' or 'hostname --usage' for more information.
Traceback (most recent call last):
  File "/code/git/learnable-activations/mflow.py", line 429, in <module>
    run_experiment(args)
  File "/code/git/learnable-activations/mflow.py", line 384, in run_experiment
    model_engine, optimizer = prepare_deepspeed_model(model, args)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/git/learnable-activations/mflow.py", line 266, in prepare_deepspeed_model
    model_engine, _, _, _ = deepspeed.initialize(
                            ^^^^^^^^^^^^^^^^^^^^^
  File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/__init__.py", line 144, in initialize
    dist.init_distributed(dist_backend=dist_backend,
  File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 673, in init_distributed
    mpi_discovery(distributed_port=distributed_port, verbose=verbose)
  File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 701, in mpi_discovery
    result = subprocess.check_output(hostname_cmd, shell=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['hostname -I']' returned non-zero exit status 64.

System info (please complete the following information):

  • OS: Arch
  • GPU count and types x1 7900xtx
  • Interconnects (if applicable) one machine
  • Python version 3.12
  • Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

#!/bin/bash
export OMPI_MCA_accelerator=rocm
mpirun -np 1 --mca accelerator rocm python mflow.py --deepspeed_config ds_config.json --log_interval 100 --batch_size 4 --local_rank -1

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

the offending code:

master_addr = None
    if rank == 0:
        hostname_cmd = ["hostname -I"]
        result = subprocess.check_output(hostname_cmd, shell=True)
        master_addr = result.decode('utf-8').split()[0]
    master_addr = comm.bcast(master_addr, root=0)

sirus20x6 avatar Sep 05 '24 09:09 sirus20x6

Hi @sirus20x6 - this issue looks to be similar to this one: https://github.com/microsoft/DeepSpeed/issues/5597

Could you share the output of hostname --help and hostname -V?

loadams avatar Sep 05 '24 15:09 loadams

here you go!

 ~  hostname --help                                                     ✔  11:11:59
Usage: hostname [OPTION...] [NAME]
Show or set the system's host name.

  -a, --aliases              alias names
  -d, --domain               DNS domain name
  -f, --fqdn, --long         DNS host name or FQDN
  -F, --file=FILE            set host name or NIS domain name from FILE
  -i, --ip-addresses         addresses for the host name
  -s, --short                short host name
  -y, --yp, --nis            NIS/YP domain name
  -?, --help                 give this help list
      --usage                give a short usage message
  -V, --version              print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

Report bugs to <[email protected]>.
 ~                                                                      ✔  11:12:02
 ~  hostname -V                                                      64 ✘  11:12:39
hostname (GNU inetutils) 2.5
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Debarshi Ray.
 ~                                                                      ✔  11:13:03

and I believe that the posix way of doing this is actually

getent hosts localhost

because net-utils which is where the hostname binary is from is sort of an old deprecated package even though a lot of people still have it installed because they have a lot of muscle memory around those tools

sirus20x6 avatar Sep 05 '24 16:09 sirus20x6

small correction, actually if you just want the first field that posix way of getting loopback is

getent hosts localhost | awk '{ print $1 }'

sirus20x6 avatar Sep 05 '24 16:09 sirus20x6

Thanks, @sirus20x6 - we are also looking at switching to just using socket.gethostname() and socket.gethostbyname_ex() to work around this entirely, do you think that would work for your needs?

loadams avatar Sep 05 '24 16:09 loadams

I believe so. Hopefully that will be more cross-platform and resilient

sirus20x6 avatar Sep 05 '24 16:09 sirus20x6

If you want, you could test with pip install git+https://github.com/microsoft/deepspeed.git@loadams/update-hostname-I

loadams avatar Sep 05 '24 16:09 loadams

I will test as soon as I get home to my machine!

sirus20x6 avatar Sep 05 '24 18:09 sirus20x6

doesn't install

> pip uninstall deepspeed
Found existing installation: deepspeed 0.15.2+c210e601
Uninstalling deepspeed-0.15.2+c210e601:
  Would remove:
    /thearray/git/ComfyUI/comfyvenv/bin/deepspeed
    /thearray/git/ComfyUI/comfyvenv/bin/deepspeed.pt
    /thearray/git/ComfyUI/comfyvenv/bin/ds
    /thearray/git/ComfyUI/comfyvenv/bin/ds_bench
    /thearray/git/ComfyUI/comfyvenv/bin/ds_elastic
    /thearray/git/ComfyUI/comfyvenv/bin/ds_report
    /thearray/git/ComfyUI/comfyvenv/bin/ds_ssh
    /thearray/git/ComfyUI/comfyvenv/bin/dsr
    /thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed-0.15.2+c210e601.dist-info/*
    /thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/*
Proceed (Y/n)? y
  Successfully uninstalled deepspeed-0.15.2+c210e601
(comfyvenv) (base) neuromancer :) > pip install git+https://github.com/microsoft/deepspeed.git@loadams/update-hostname-I
Collecting git+https://github.com/microsoft/deepspeed.git@loadams/update-hostname-I
  Cloning https://github.com/microsoft/deepspeed.git (to revision loadams/update-hostname-I) to /tmp/pip-req-build-lvq7vagu
  Running command git clone --filter=blob:none --quiet https://github.com/microsoft/deepspeed.git /tmp/pip-req-build-lvq7vagu
  Running command git checkout -b loadams/update-hostname-I --track origin/loadams/update-hostname-I
  Switched to a new branch 'loadams/update-hostname-I'
  branch 'loadams/update-hostname-I' set up to track 'origin/loadams/update-hostname-I'.
  Resolved https://github.com/microsoft/deepspeed.git to commit 0d2aada49e58490a5a38867b0475f4b57e12c2ae
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [63 lines of output]
      [2024-09-05 23:23:45,883] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      [2024-09-05 23:23:46,521] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      /thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
        _torch_pytree._register_pytree_node(
      /thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
        _torch_pytree._register_pytree_node(
      DS_BUILD_OPS=0
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/py_ds_aio.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/py_ds_aio.cpp [skipped, no changes]
      Successfully preprocessed all matching files.
      Total number of unsupported CUDA function calls: 0
      
      
      Total number of replaced kernel launches: 0
      /tmp/pip-req-build-lvq7vagu/csrc/adam/fused_adam_frontend.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/adam/fused_adam_frontend.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/includes/compat.h -> /tmp/pip-req-build-lvq7vagu/csrc/includes/compat.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_apply.cuh -> /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_apply_hip.cuh [ok]
      /tmp/pip-req-build-lvq7vagu/csrc/includes/type_shim.h -> /tmp/pip-req-build-lvq7vagu/csrc/includes/type_shim_hip.h [ok]
      /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_adam.cu -> /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_adam.hip [ok]
      Successfully preprocessed all matching files.
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-req-build-lvq7vagu/setup.py", line 198, in <module>
          ext_modules.append(builder.builder())
                             ^^^^^^^^^^^^^^^^^
        File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 699, in builder
          {'cxx': self.strip_empty_entries(self.cxx_args()), \
                                           ^^^^^^^^^^^^^^^
        File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 842, in cxx_args
          CUDA_ENABLE = self.is_cuda_enable()
                        ^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 420, in is_cuda_enable
          assert_no_cuda_mismatch(self.name)
        File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 86, in assert_no_cuda_mismatch
          torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
                                        ^^^^^^^^^^^^^^^^^^^^^^^^
      AttributeError: 'NoneType' object has no attribute 'split'
      Total number of unsupported CUDA function calls: 0
      
      
      Total number of replaced kernel launches: 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

sirus20x6 avatar Sep 06 '24 04:09 sirus20x6

+1 for the socket approach

replacing the subprocess calls from here in deepspeed/comm/comm.py#L700-L702

with

import socket
master_addr = socket.gethostbyaddr(socket.gethostname())[0]

has been working for me on internal systems

+    import socket
-    master_addr = None
     if rank == 0:
-        hostname_cmd = ["hostname -I"]
-        result = subprocess.check_output(hostname_cmd, shell=True)
-        master_addr = result.decode('utf-8').split()[0]
+        master_addr = socket.gethostbyaddr(socket.gethostname())[0]

also see: #2837

I'd be happy to submit a PR + test further if it would be useful

saforem2 avatar Sep 10 '24 20:09 saforem2

@saforem2, thanks for offering to help with this. Please see our concerns here. Would appreciate your insights and PR.

tjruwase avatar Sep 10 '24 21:09 tjruwase

This should be resolved with this PR: #6990. Let me know if you are still having issues with this.

loadams avatar Feb 14 '25 16:02 loadams