YaLM-100B Timeout on 8 x RTX A6000

Thank you for making your work publicly available!
I am trying to test your model on a 8xRTX6000 cards, and I'm getting a timeout error:
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234                                                                                                      
building GPT2 model ...                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                            
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805074 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805088 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805091 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805093 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805099 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805104 milliseconds before timing out.                                                                                         
>> Loading layer_00-model_00-model_states.pt on CPU [mp 06 / 8]                                                                                                                                                                                                             
>> Loading layer_00-model_00-model_states.pt on CPU [mp 05 / 8]                                                                                                                                                                                                             
>> Loading layer_00-model_00-model_states.pt on CPU [mp 03 / 8]                                                                                                                                                                                                             
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805221 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805235 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.           
terminate called after throwing an instance of 'std::runtime_error'                                                                                                                                                                                                         
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805074 milliseconds before timing out.                                                                                                           
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.           
terminate called after throwing an instance of 'std::runtime_error'                                                                                                                                                                                                         
  what():  [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805091 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805093 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805104 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805088 milliseconds before timing out.
>> Loading layer_00-model_00-model_states.pt on CPU [mp 04 / 8]
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805099 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805235 milliseconds before timing out.
> Start loading from release checkpoint from folder yalm100b_checkpoint/weights
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805221 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1824 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1827 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1828 closing signal SIGTERM
What causes this error, and how could I overcome it?
Jun 01 '23 08:06 andrewerf
Okay, I've figured out the solution, but not the reason. Setting NCCL_P2P_DISABLE=1 fixes the timeout issue and the model runs perfectly well. I think it has something to do with the NVLink connections topology of the graphics cards, but I am not sure.
Jun 03 '23 07:06 andrewerf
@andrewerf can you provide the code to test it? I am trying to load the model and test it but it's giving me a lot of errors.
Jul 19 '23 13:07 Osamarafique998
YaLM-100B YaLM-100B copied to clipboard

Timeout on 8 x RTX A6000

YaLM-100B
YaLM-100B copied to clipboard