ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: GPT single node multi-card training occurred NCCL Error

Open tianxin1860 opened this issue 1 year ago • 2 comments

🐛 Describe the bug

when I run examples/language/gpt/gemini/run_gemini.sh scripts base on official Image hpcaitech/colossalai:0.2.5 just using single card, everything is OK, But when I set GPU_NUM=2 by add the fllowing codes to the script, then occurred NCCL Error:

GPUNUM=2
export CUDA_VISIBLE_DEVICES=0,1
image image

Environment

The python packages is the following:

Package                Version
---------------------- ---------------------
apex                   0.1
astunparse             1.6.3
bcrypt                 4.0.1
brotlipy               0.7.0
certifi                2022.12.7
cffi                   1.15.0
cfgv                   3.3.1
charset-normalizer     2.0.4
click                  8.1.3
colorama               0.4.4
colossalai             0.2.0+torch1.12cu11.3
commonmark             0.9.1
conda                  22.11.1
conda-content-trust    0+unknown
conda-package-handling 1.8.1
contexttimer           0.3.3
cryptography           36.0.0
distlib                0.3.6
fabric                 2.7.1
filelock               3.9.0
flit_core              3.6.0
gast                   0.4.0
huggingface-hub        0.12.1
identify               2.5.12
idna                   3.3
invoke                 1.7.3
mkl-fft                1.3.1
mkl-random             1.2.2
mkl-service            2.4.0
ninja                  1.11.1
nodeenv                1.7.0
numpy                  1.22.3
nvidia-dali-cuda110    1.23.0
packaging              23.0
paramiko               2.12.0
pathlib2               2.3.7.post1
Pillow                 9.0.1
pip                    21.2.4
platformdirs           2.6.2
pluggy                 1.0.0
pre-commit             2.21.0
psutil                 5.9.4
pycosat                0.6.3
pycparser              2.21
Pygments               2.14.0
PyNaCl                 1.5.0
pyOpenSSL              22.0.0
PySocks                1.7.1
PyYAML                 6.0
regex                  2022.10.31
requests               2.27.1
rich                   13.0.1
ruamel.yaml            0.16.12
ruamel.yaml.clib       0.2.6
ruamel-yaml-conda      0.15.100
setuptools             61.2.0
six                    1.16.0
tensornvme             0.1.0
timm                   0.6.12
titans                 0.0.7
tokenizers             0.13.2
toolz                  0.12.0
torch                  1.12.1
torchaudio             0.12.1
torchvision            0.13.1
tqdm                   4.63.0
transformers           4.26.1
typing_extensions      4.4.0
urllib3                1.26.8
virtualenv             20.17.1
wheel                  0.37.1

tianxin1860 avatar Mar 14 '23 08:03 tianxin1860

Can you trying mounting /dev/shm into the container? Like adding to docker command --mount type=bind,source=/dev/shm,target=/dev/shm.

JThh avatar Mar 15 '23 04:03 JThh

Can you trying mounting /dev/shm into the container? Like adding to docker command --mount type=bind,source=/dev/shm,target=/dev/shm.

Doing this works, thanks for your reply.

tianxin1860 avatar Mar 15 '23 14:03 tianxin1860