ColossalAI
ColossalAI copied to clipboard
[BUG]: Regarding the supervised instructs tuning for Coati
🐛 Describe the bug
I executed the training command of supervised instructs tuning for the Coati following the instruction in the README.md. It raised the error related to NCCL as shown below. Can anyone help me with this issue? Thanks in advance.
Environment
python==3.9.16 pytorch==1.13.1 build:py3.9_cuda11.7_cudnn8.5.0_0 pytorch-cuda==11.7 cuda-cudart==11.7.99 cuda-cupti==11.7.101 cuda-libraries==11.7.1 cuda-nvcc==12.1.66 cuda-nvcc_linux-64==11.7.1 cuda-nvrtc==11.7.99 cuda-nvtx==11.7.91 cuda-runtime==11.7.1 cudatoolkit==11.7.0 nccl==2.14.3.1
Hi, are your GPU cards on the same machine?
@JThh Yes. All the GPU cards are on the same machine.
hi, @mynamedaike, for the problem of "Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels", this may happen when you load or tokenize a large dataset for the first time and it takes time longer than the preset timeout time. You may try to add a timeout argument to the Accelerator
constructor which the default is 1800.