[BUG]: MoE example GroupNCCL cleanup error during training done and exiting CUDA
π Describe the bug
I ran into error when training the MoE example(https://github.com/hpcaitech/ColossalAI-Examples/tree/5b23e8cf22cf029b9ac77c2ed92bbc339e7fbd4e/image/moe), each time when upon finishing the last iteration, it threw the following errors while CUDA shutting down:
[Epoch 99 / Test]: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 20/20 [00:03<00:00, 5.12it/s, accuracy=0.88235, loss=0.847, throughput=3424.7 sample_per_sec]
[08/09/22 19:03:05] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.8/site-packages/colossalai/trainer/hooks/_log_hook.py:104 after_test_epoch
INFO colossalai - colossalai - INFO: [Epoch 99 / Test]: Accuracy = 0.8823 | Loss = 0.79231 | Throughput = 3403.1
terminate called after throwing an instance of βc10::CUDAErrorβ
what(): CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f1817bb01bd in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f1855a3f6ea in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f1855a41cd0 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f1855a42f65 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4:
Environment
CUDA Version: 11.3 PyTorch Version: 1.11.0 CUDA Version in PyTorch Build: 11.3 PyTorch CUDA Version Match: β CUDA Extension: β Python: 3.8.12
colossalai 0.1.8+torch1.11cu11.3 energonai 0.0.1b0
Hi @nostalgicimp,
I just can't reproduce the error mentioned above. Try to add a barrier in the end of the code. Did you encounter the problem all the time? Please tell me how to reproduce the error.
Hi @1SAA, Yes, it happens all the time even if I change the config to a small epoch size.(which is quicker to finish and show the error) To reproduce the problem, please find the info below
cmd: torchrun --nnodes=1 --nproc_per_node=2 train.py --config ./config.py
config.py: BATCH_SIZE = 512 LEARNING_RATE = 2e-3 WEIGHT_DECAY = 3e-2
NUM_EPOCHS = 8 WARMUP_EPOCHS = 5
parallel = dict() max_ep_size = 1 # all experts are replicated in the case that user only has 1 GPU clip_grad_norm = 1.0 # enable gradient clipping and set it to 1.0 LOG_PATH = f"./cifar10_moe"
Error logs:
[Epoch 7 / Test]: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 20/20 [00:04<00:00, 4.01it/s, accuracy=0.56985, loss=1.47, throughput=2511.2 sample_per_sec]
[08/12/22 05:43:50] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.8/site-packages/colossalai/trainer/hooks/_log_hook.py:104 after_test_epoch
INFO colossalai - colossalai - INFO: [Epoch 7 / Test]: Accuracy = 0.6114 | Loss = 1.3514 | Throughput = 2496.8
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f86121a21bd in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f86500316ea in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f8650033cd0 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f8650034f65 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 45837) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in
train.py FAILED Failures: <NO_OTHER_FAILURES> Root Cause (first observed failure): [0]: time : 2022-08-12_05:43:58 host : colossalai-2gpu-pod rank : 0 (local_rank: 0) exitcode : -6 (pid: 45837) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 45837
Hi @nostalgicimp ,
I still can't reproduce the problem. It seems all goes well in my server. Maybe you could set the environment variable CUDA_LAUNCH_BLOCKING=1 for more information about your error. Or add torch.distributed.destroy_process_group() in the end of the training script.
We have updated a lot. This issue was closed due to inactivity. Thanks.