ColossalAI
ColossalAI copied to clipboard
[BUG]: The IPv6 network addresses of (gpu2, 37615) cannot be retrieved (gai error: -2 - Name or service not known)
π Describe the bug
when I run gpt example based on official docker image hpcaitech/colossalai:0.2.5, the error occured:
Environment
No response
Can you share your environment settings and try adding --network=host to your training command?
Can you share your environment settings and try adding
--network=hostto your training command?
When I create the container baes on hpcaitech/colossalai:0.2.5 Image, I have set --network=host, just like this
docker run -it -u root --network=host \
--name colossal_llm \
--runtime=nvidia \
-v /mnt/data/:/mnt/data/ \
hpcaitech/colossalai:0.2.5 \
/bin/bash
Can you share your environment settings and try adding
--network=hostto your training command?
The python packages is the following:
Package Version
---------------------- ---------------------
apex 0.1
astunparse 1.6.3
bcrypt 4.0.1
brotlipy 0.7.0
certifi 2022.12.7
cffi 1.15.0
cfgv 3.3.1
charset-normalizer 2.0.4
click 8.1.3
colorama 0.4.4
colossalai 0.2.0+torch1.12cu11.3
commonmark 0.9.1
conda 22.11.1
conda-content-trust 0+unknown
conda-package-handling 1.8.1
contexttimer 0.3.3
cryptography 36.0.0
distlib 0.3.6
fabric 2.7.1
filelock 3.9.0
flit_core 3.6.0
gast 0.4.0
huggingface-hub 0.12.1
identify 2.5.12
idna 3.3
invoke 1.7.3
mkl-fft 1.3.1
mkl-random 1.2.2
mkl-service 2.4.0
ninja 1.11.1
nodeenv 1.7.0
numpy 1.22.3
nvidia-dali-cuda110 1.23.0
packaging 23.0
paramiko 2.12.0
pathlib2 2.3.7.post1
Pillow 9.0.1
pip 21.2.4
platformdirs 2.6.2
pluggy 1.0.0
pre-commit 2.21.0
psutil 5.9.4
pycosat 0.6.3
pycparser 2.21
Pygments 2.14.0
PyNaCl 1.5.0
pyOpenSSL 22.0.0
PySocks 1.7.1
PyYAML 6.0
regex 2022.10.31
requests 2.27.1
rich 13.0.1
ruamel.yaml 0.16.12
ruamel.yaml.clib 0.2.6
ruamel-yaml-conda 0.15.100
setuptools 61.2.0
six 1.16.0
tensornvme 0.1.0
timm 0.6.12
titans 0.0.7
tokenizers 0.13.2
toolz 0.12.0
torch 1.12.1
torchaudio 0.12.1
torchvision 0.13.1
tqdm 4.63.0
transformers 4.26.1
typing_extensions 4.4.0
urllib3 1.26.8
virtualenv 20.17.1
wheel 0.37.1
OS Systems is the following:
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
Hi, can you remove --runtime=nvidia and try again? And take a look at this post.
Actually I cannot replicate your issue. Would you try as per this?
remove --network=hostwill solve your question. @tianxin1860
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
remove --network=hostwill solve your question. @tianxin1860
Thanks! @codender This issue was closed due to inactivity. Thanks.