PAGCP icon indicating copy to clipboard operation
PAGCP copied to clipboard

DDP训练报错,单卡训练没有问题

Open gongliuqing321 opened this issue 1 year ago • 1 comments

dingtalkgov_qt_clipbord_pic_2

gongliuqing321 avatar Jul 30 '24 06:07 gongliuqing321

Hello, thanks for ur attention to our work. It seems to be the communication issue between different GPUs in the cluster, which might be caused by network latency or load imbalance between GPUs. Could you check the CUDA version and torch version? Upgrading the version may be one approach. Another way to solve the problem is to increase the timeout limit NCCL_TIMEOUT_MS.

HankYe avatar Aug 24 '24 01:08 HankYe