cuad icon indicating copy to clipboard operation
cuad copied to clipboard

NCCL Error 1: unhandled cuda error

Open ShuJackson opened this issue 4 years ago • 3 comments

When I run the training script, I ran into an instance of 'std::runtime_error' what(): NCCL Error 1: unhandled cuda error ./run.sh

This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step completes successfully and right after 'Evaluating: 0%' is printed.

I have made sure torch can pick up the cuda info:

print(torch.cuda.is_available()) True

image

ShuJackson avatar Jun 10 '21 17:06 ShuJackson

@TheAtticusProject

ShuJackson avatar Jun 10 '21 17:06 ShuJackson

This is a very low-level issue, and unfortunately "NCCL Error 1: unhandled cuda error" means that even CUDA does not know what it is. I could only suggest updating drivers or seeing if there is a more detailed error log, but even then this would be a CUDA or hardware issue.

hendrycks avatar Jun 11 '21 16:06 hendrycks

请问怎么运行脚本呢,需要修改什么文件和怎么执行代码可以教授我一二吗

Mei0211 avatar Feb 16 '22 01:02 Mei0211