ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: The IPv6 network addresses of (gpu2, 37615) cannot be retrieved (gai error: -2 - Name or service not known)

Open tianxin1860 opened this issue 2 years ago β€’ 8 comments

πŸ› Describe the bug

when I run gpt example based on official docker image hpcaitech/colossalai:0.2.5, the error occured:

image

Environment

No response

tianxin1860 avatar Mar 01 '23 15:03 tianxin1860

Can you share your environment settings and try adding --network=host to your training command?

JThh avatar Mar 02 '23 13:03 JThh

Can you share your environment settings and try adding --network=host to your training command?

When I create the container baes on hpcaitech/colossalai:0.2.5 Image, I have set --network=host, just like this

docker run -it -u root --network=host \
--name colossal_llm \
--runtime=nvidia \
-v /mnt/data/:/mnt/data/ \
hpcaitech/colossalai:0.2.5 \
/bin/bash

tianxin1860 avatar Mar 03 '23 01:03 tianxin1860

Can you share your environment settings and try adding --network=host to your training command?

The python packages is the following:

Package                Version
---------------------- ---------------------
apex                   0.1
astunparse             1.6.3
bcrypt                 4.0.1
brotlipy               0.7.0
certifi                2022.12.7
cffi                   1.15.0
cfgv                   3.3.1
charset-normalizer     2.0.4
click                  8.1.3
colorama               0.4.4
colossalai             0.2.0+torch1.12cu11.3
commonmark             0.9.1
conda                  22.11.1
conda-content-trust    0+unknown
conda-package-handling 1.8.1
contexttimer           0.3.3
cryptography           36.0.0
distlib                0.3.6
fabric                 2.7.1
filelock               3.9.0
flit_core              3.6.0
gast                   0.4.0
huggingface-hub        0.12.1
identify               2.5.12
idna                   3.3
invoke                 1.7.3
mkl-fft                1.3.1
mkl-random             1.2.2
mkl-service            2.4.0
ninja                  1.11.1
nodeenv                1.7.0
numpy                  1.22.3
nvidia-dali-cuda110    1.23.0
packaging              23.0
paramiko               2.12.0
pathlib2               2.3.7.post1
Pillow                 9.0.1
pip                    21.2.4
platformdirs           2.6.2
pluggy                 1.0.0
pre-commit             2.21.0
psutil                 5.9.4
pycosat                0.6.3
pycparser              2.21
Pygments               2.14.0
PyNaCl                 1.5.0
pyOpenSSL              22.0.0
PySocks                1.7.1
PyYAML                 6.0
regex                  2022.10.31
requests               2.27.1
rich                   13.0.1
ruamel.yaml            0.16.12
ruamel.yaml.clib       0.2.6
ruamel-yaml-conda      0.15.100
setuptools             61.2.0
six                    1.16.0
tensornvme             0.1.0
timm                   0.6.12
titans                 0.0.7
tokenizers             0.13.2
toolz                  0.12.0
torch                  1.12.1
torchaudio             0.12.1
torchvision            0.13.1
tqdm                   4.63.0
transformers           4.26.1
typing_extensions      4.4.0
urllib3                1.26.8
virtualenv             20.17.1
wheel                  0.37.1

tianxin1860 avatar Mar 03 '23 01:03 tianxin1860

OS Systems is the following:

NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

tianxin1860 avatar Mar 03 '23 01:03 tianxin1860

Hi, can you remove --runtime=nvidia and try again? And take a look at this post.

JThh avatar Mar 03 '23 03:03 JThh

Actually I cannot replicate your issue. Would you try as per this?

JThh avatar Mar 03 '23 03:03 JThh

remove --network=hostwill solve your question. @tianxin1860

codender avatar Mar 13 '23 06:03 codender

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


remove --network=hostwill solve your question. @tianxin1860

Issues-translate-bot avatar Mar 13 '23 06:03 Issues-translate-bot

Thanks! @codender This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 27 '23 08:04 binmakeswell