Kefeng Ning
Results
2
comments of
Kefeng Ning
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL
solved by setting Timeout to 6000_000 in distributed.py
Thank you for the excellent works. What's the progress about this issue?