ColossalAI
ColossalAI copied to clipboard
[BUG]: GPT single node multi-card training occurred NCCL Error
🐛 Describe the bug
when I run examples/language/gpt/gemini/run_gemini.sh scripts base on official Image hpcaitech/colossalai:0.2.5
just using single card, everything is OK, But when I set GPU_NUM=2 by add the fllowing codes to the script, then occurred NCCL Error:
GPUNUM=2
export CUDA_VISIBLE_DEVICES=0,1


Environment
The python packages is the following:
Package Version
---------------------- ---------------------
apex 0.1
astunparse 1.6.3
bcrypt 4.0.1
brotlipy 0.7.0
certifi 2022.12.7
cffi 1.15.0
cfgv 3.3.1
charset-normalizer 2.0.4
click 8.1.3
colorama 0.4.4
colossalai 0.2.0+torch1.12cu11.3
commonmark 0.9.1
conda 22.11.1
conda-content-trust 0+unknown
conda-package-handling 1.8.1
contexttimer 0.3.3
cryptography 36.0.0
distlib 0.3.6
fabric 2.7.1
filelock 3.9.0
flit_core 3.6.0
gast 0.4.0
huggingface-hub 0.12.1
identify 2.5.12
idna 3.3
invoke 1.7.3
mkl-fft 1.3.1
mkl-random 1.2.2
mkl-service 2.4.0
ninja 1.11.1
nodeenv 1.7.0
numpy 1.22.3
nvidia-dali-cuda110 1.23.0
packaging 23.0
paramiko 2.12.0
pathlib2 2.3.7.post1
Pillow 9.0.1
pip 21.2.4
platformdirs 2.6.2
pluggy 1.0.0
pre-commit 2.21.0
psutil 5.9.4
pycosat 0.6.3
pycparser 2.21
Pygments 2.14.0
PyNaCl 1.5.0
pyOpenSSL 22.0.0
PySocks 1.7.1
PyYAML 6.0
regex 2022.10.31
requests 2.27.1
rich 13.0.1
ruamel.yaml 0.16.12
ruamel.yaml.clib 0.2.6
ruamel-yaml-conda 0.15.100
setuptools 61.2.0
six 1.16.0
tensornvme 0.1.0
timm 0.6.12
titans 0.0.7
tokenizers 0.13.2
toolz 0.12.0
torch 1.12.1
torchaudio 0.12.1
torchvision 0.13.1
tqdm 4.63.0
transformers 4.26.1
typing_extensions 4.4.0
urllib3 1.26.8
virtualenv 20.17.1
wheel 0.37.1
Can you trying mounting /dev/shm
into the container? Like adding to docker command --mount type=bind,source=/dev/shm,target=/dev/shm
.
Can you trying mounting
/dev/shm
into the container? Like adding to docker command--mount type=bind,source=/dev/shm,target=/dev/shm
.
Doing this works, thanks for your reply.