llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

timeout error

Open NarenZen opened this issue 2 years ago • 3 comments

I got the below when finetuning with mpt-7b_dolly_sft.yaml

Dataset: mosaicml/dolly_hhrlhf

[E ProcessGroupNCCL.cpp:828] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=600000) ran for 606142 milliseconds before timing out.
/usr/local/lib/python3.10/dist-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: amp_bf16; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor.
  warnings.warn(
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=600000) ran for 606142 milliseconds before timing out.

NarenZen avatar May 26 '23 16:05 NarenZen

tried passing dist_timeout=1200.0 even then timeout happens.

NarenZen avatar May 26 '23 17:05 NarenZen

I encountered the same problem. Could you tell me how you resolved it?

yqli2420 avatar May 31 '23 10:05 yqli2420

i have the same error

j-Gaow avatar Jun 12 '23 09:06 j-Gaow

Hi, there is not enough information here to debug. NCCL timeouts can happen for a variety of reasons. I am going to close this issue as stale. Please try again on main, and open a new issue if you are still encountering problems.

dakinggg avatar Sep 07 '23 02:09 dakinggg