[BUG] deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i"
Describe the bug A clear and concise description of what the bug is. deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i"
To Reproduce Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
Expected behavior A clear and concise description of what you expected to happen.
ds_report output
Please run ds_report to give us details about your setup.
Screenshots If applicable, add screenshots to help explain your problem.
Processing dataset chunks: 100%|██████████| 106/106 [00:11<00:00, 9.45it/s]
[2024-09-05 04:11:37,288] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.15.2+c210e601, git-hash=c210e601, git-branch=master
[2024-09-05 04:11:37,288] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-05 04:11:37,288] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
hostname: invalid option -- 'I'
Try 'hostname --help' or 'hostname --usage' for more information.
Traceback (most recent call last):
File "/code/git/learnable-activations/mflow.py", line 429, in <module>
run_experiment(args)
File "/code/git/learnable-activations/mflow.py", line 384, in run_experiment
model_engine, optimizer = prepare_deepspeed_model(model, args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/code/git/learnable-activations/mflow.py", line 266, in prepare_deepspeed_model
model_engine, _, _, _ = deepspeed.initialize(
^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/__init__.py", line 144, in initialize
dist.init_distributed(dist_backend=dist_backend,
File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 673, in init_distributed
mpi_discovery(distributed_port=distributed_port, verbose=verbose)
File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 701, in mpi_discovery
result = subprocess.check_output(hostname_cmd, shell=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/subprocess.py", line 466, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['hostname -I']' returned non-zero exit status 64.
System info (please complete the following information):
- OS: Arch
- GPU count and types x1 7900xtx
- Interconnects (if applicable) one machine
- Python version 3.12
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
#!/bin/bash
export OMPI_MCA_accelerator=rocm
mpirun -np 1 --mca accelerator rocm python mflow.py --deepspeed_config ds_config.json --log_interval 100 --batch_size 4 --local_rank -1
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.
the offending code:
master_addr = None
if rank == 0:
hostname_cmd = ["hostname -I"]
result = subprocess.check_output(hostname_cmd, shell=True)
master_addr = result.decode('utf-8').split()[0]
master_addr = comm.bcast(master_addr, root=0)
Hi @sirus20x6 - this issue looks to be similar to this one: https://github.com/microsoft/DeepSpeed/issues/5597
Could you share the output of hostname --help and hostname -V?
here you go!
~ hostname --help ✔ 11:11:59
Usage: hostname [OPTION...] [NAME]
Show or set the system's host name.
-a, --aliases alias names
-d, --domain DNS domain name
-f, --fqdn, --long DNS host name or FQDN
-F, --file=FILE set host name or NIS domain name from FILE
-i, --ip-addresses addresses for the host name
-s, --short short host name
-y, --yp, --nis NIS/YP domain name
-?, --help give this help list
--usage give a short usage message
-V, --version print program version
Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.
Report bugs to <[email protected]>.
~ ✔ 11:12:02
~ hostname -V 64 ✘ 11:12:39
hostname (GNU inetutils) 2.5
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Debarshi Ray.
~ ✔ 11:13:03
and I believe that the posix way of doing this is actually
getent hosts localhost
because net-utils which is where the hostname binary is from is sort of an old deprecated package even though a lot of people still have it installed because they have a lot of muscle memory around those tools
small correction, actually if you just want the first field that posix way of getting loopback is
getent hosts localhost | awk '{ print $1 }'
Thanks, @sirus20x6 - we are also looking at switching to just using socket.gethostname() and socket.gethostbyname_ex() to work around this entirely, do you think that would work for your needs?
I believe so. Hopefully that will be more cross-platform and resilient
If you want, you could test with pip install git+https://github.com/microsoft/deepspeed.git@loadams/update-hostname-I
I will test as soon as I get home to my machine!
doesn't install
> pip uninstall deepspeed
Found existing installation: deepspeed 0.15.2+c210e601
Uninstalling deepspeed-0.15.2+c210e601:
Would remove:
/thearray/git/ComfyUI/comfyvenv/bin/deepspeed
/thearray/git/ComfyUI/comfyvenv/bin/deepspeed.pt
/thearray/git/ComfyUI/comfyvenv/bin/ds
/thearray/git/ComfyUI/comfyvenv/bin/ds_bench
/thearray/git/ComfyUI/comfyvenv/bin/ds_elastic
/thearray/git/ComfyUI/comfyvenv/bin/ds_report
/thearray/git/ComfyUI/comfyvenv/bin/ds_ssh
/thearray/git/ComfyUI/comfyvenv/bin/dsr
/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed-0.15.2+c210e601.dist-info/*
/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/*
Proceed (Y/n)? y
Successfully uninstalled deepspeed-0.15.2+c210e601
(comfyvenv) (base) neuromancer :) > pip install git+https://github.com/microsoft/deepspeed.git@loadams/update-hostname-I
Collecting git+https://github.com/microsoft/deepspeed.git@loadams/update-hostname-I
Cloning https://github.com/microsoft/deepspeed.git (to revision loadams/update-hostname-I) to /tmp/pip-req-build-lvq7vagu
Running command git clone --filter=blob:none --quiet https://github.com/microsoft/deepspeed.git /tmp/pip-req-build-lvq7vagu
Running command git checkout -b loadams/update-hostname-I --track origin/loadams/update-hostname-I
Switched to a new branch 'loadams/update-hostname-I'
branch 'loadams/update-hostname-I' set up to track 'origin/loadams/update-hostname-I'.
Resolved https://github.com/microsoft/deepspeed.git to commit 0d2aada49e58490a5a38867b0475f4b57e12c2ae
Running command git submodule update --init --recursive -q
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [63 lines of output]
[2024-09-05 23:23:45,883] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-05 23:23:46,521] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
_torch_pytree._register_pytree_node(
/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
_torch_pytree._register_pytree_node(
DS_BUILD_OPS=0
/tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.h [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.h [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.h [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.h [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.h [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.h [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.h [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.h [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.h [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.cpp [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.cpp [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.h [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.cpp [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.cpp [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.cpp [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.cpp [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.cpp [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.cpp [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.h [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.cpp [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.cpp [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/py_ds_aio.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/py_ds_aio.cpp [skipped, no changes]
Successfully preprocessed all matching files.
Total number of unsupported CUDA function calls: 0
Total number of replaced kernel launches: 0
/tmp/pip-req-build-lvq7vagu/csrc/adam/fused_adam_frontend.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/adam/fused_adam_frontend.cpp [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/includes/compat.h -> /tmp/pip-req-build-lvq7vagu/csrc/includes/compat.h [skipped, no changes]
/tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_apply.cuh -> /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_apply_hip.cuh [ok]
/tmp/pip-req-build-lvq7vagu/csrc/includes/type_shim.h -> /tmp/pip-req-build-lvq7vagu/csrc/includes/type_shim_hip.h [ok]
/tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_adam.cu -> /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_adam.hip [ok]
Successfully preprocessed all matching files.
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-req-build-lvq7vagu/setup.py", line 198, in <module>
ext_modules.append(builder.builder())
^^^^^^^^^^^^^^^^^
File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 699, in builder
{'cxx': self.strip_empty_entries(self.cxx_args()), \
^^^^^^^^^^^^^^^
File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 842, in cxx_args
CUDA_ENABLE = self.is_cuda_enable()
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 420, in is_cuda_enable
assert_no_cuda_mismatch(self.name)
File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 86, in assert_no_cuda_mismatch
torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'split'
Total number of unsupported CUDA function calls: 0
Total number of replaced kernel launches: 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
+1 for the socket approach
replacing the subprocess calls from here in deepspeed/comm/comm.py#L700-L702
with
import socket
master_addr = socket.gethostbyaddr(socket.gethostname())[0]
has been working for me on internal systems
+ import socket
- master_addr = None
if rank == 0:
- hostname_cmd = ["hostname -I"]
- result = subprocess.check_output(hostname_cmd, shell=True)
- master_addr = result.decode('utf-8').split()[0]
+ master_addr = socket.gethostbyaddr(socket.gethostname())[0]
also see: #2837
I'd be happy to submit a PR + test further if it would be useful
@saforem2, thanks for offering to help with this. Please see our concerns here. Would appreciate your insights and PR.
This should be resolved with this PR: #6990. Let me know if you are still having issues with this.