ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Regarding the supervised instructs tuning for Coati

Open mynamedaike opened this issue 1 year ago • 3 comments

🐛 Describe the bug

I executed the training command of supervised instructs tuning for the Coati following the instruction in the README.md. It raised the error related to NCCL as shown below. Can anyone help me with this issue? Thanks in advance.

image

Environment

python==3.9.16 pytorch==1.13.1 build:py3.9_cuda11.7_cudnn8.5.0_0 pytorch-cuda==11.7 cuda-cudart==11.7.99 cuda-cupti==11.7.101 cuda-libraries==11.7.1 cuda-nvcc==12.1.66 cuda-nvcc_linux-64==11.7.1 cuda-nvrtc==11.7.99 cuda-nvtx==11.7.91 cuda-runtime==11.7.1 cudatoolkit==11.7.0 nccl==2.14.3.1

image image image

mynamedaike avatar Apr 09 '23 03:04 mynamedaike

Hi, are your GPU cards on the same machine?

JThh avatar Apr 09 '23 20:04 JThh

@JThh Yes. All the GPU cards are on the same machine.

image

mynamedaike avatar Apr 11 '23 05:04 mynamedaike

hi, @mynamedaike, for the problem of "Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels", this may happen when you load or tokenize a large dataset for the first time and it takes time longer than the preset timeout time. You may try to add a timeout argument to the Accelerator constructor which the default is 1800.

Camille7777 avatar Apr 17 '23 08:04 Camille7777