ColossalAI [BUG]: MoE example GroupNCCL cleanup error during training done and exiting CUDA

🐛 Describe the bug

I ran into error when training the MoE example(https://github.com/hpcaitech/ColossalAI-Examples/tree/5b23e8cf22cf029b9ac77c2ed92bbc339e7fbd4e/image/moe), each time when upon finishing the last iteration, it threw the following errors while CUDA shutting down:

[Epoch 99 / Test]: 100%|███████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00, 5.12it/s, accuracy=0.88235, loss=0.847, throughput=3424.7 sample_per_sec] [08/09/22 19:03:05] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.8/site-packages/colossalai/trainer/hooks/_log_hook.py:104 after_test_epoch INFO colossalai - colossalai - INFO: [Epoch 99 / Test]: Accuracy = 0.8823 | Loss = 0.79231 | Throughput = 3403.1 terminate called after throwing an instance of ‘c10::CUDAError’ what(): CUDA error: driver shutting down CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f1817bb01bd in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f1855a3f6ea in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f1855a41cd0 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f1855a42f65 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #4: + 0xc9039 (0x7f18ada80039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6) frame #5: + 0x76db (0x7f18cdd4f6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x3f (0x7f18cda7871f in /lib/x86_64-linux-gnu/libc.so.6) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 7730) of binary: /opt/conda/bin/python

Environment

CUDA Version: 11.3 PyTorch Version: 1.11.0 CUDA Version in PyTorch Build: 11.3 PyTorch CUDA Version Match: ✓ CUDA Extension: ✓ Python: 3.8.12

colossalai 0.1.8+torch1.11cu11.3 energonai 0.0.1b0

Aug 10 '22 03:08 nostalgicimp

Hi @nostalgicimp,

I just can't reproduce the error mentioned above. Try to add a barrier in the end of the code. Did you encounter the problem all the time? Please tell me how to reproduce the error.

Aug 10 '22 04:08 1SAA

Hi @1SAA, Yes, it happens all the time even if I change the config to a small epoch size.(which is quicker to finish and show the error) To reproduce the problem, please find the info below

cmd: torchrun --nnodes=1 --nproc_per_node=2 train.py --config ./config.py

config.py: BATCH_SIZE = 512 LEARNING_RATE = 2e-3 WEIGHT_DECAY = 3e-2

NUM_EPOCHS = 8 WARMUP_EPOCHS = 5

parallel = dict() max_ep_size = 1 # all experts are replicated in the case that user only has 1 GPU clip_grad_norm = 1.0 # enable gradient clipping and set it to 1.0 LOG_PATH = f"./cifar10_moe"

Error logs:

[Epoch 7 / Test]: 100%|█████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00, 4.01it/s, accuracy=0.56985, loss=1.47, throughput=2511.2 sample_per_sec] [08/12/22 05:43:50] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.8/site-packages/colossalai/trainer/hooks/_log_hook.py:104 after_test_epoch INFO colossalai - colossalai - INFO: [Epoch 7 / Test]: Accuracy = 0.6114 | Loss = 1.3514 | Throughput = 2496.8 terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: driver shutting down CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f86121a21bd in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f86500316ea in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f8650033cd0 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f8650034f65 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #4: + 0xc9039 (0x7f86a8072039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6) frame #5: + 0x76db (0x7f86c83416db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x3f (0x7f86c806a71f in /lib/x86_64-linux-gnu/libc.so.6)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 45837) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')()) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED Failures: <NO_OTHER_FAILURES> Root Cause (first observed failure): [0]: time : 2022-08-12_05:43:58 host : colossalai-2gpu-pod rank : 0 (local_rank: 0) exitcode : -6 (pid: 45837) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 45837

Aug 12 '22 06:08 nostalgicimp

Hi @nostalgicimp ,

I still can't reproduce the problem. It seems all goes well in my server. Maybe you could set the environment variable CUDA_LAUNCH_BLOCKING=1 for more information about your error. Or add torch.distributed.destroy_process_group() in the end of the training script.

Aug 16 '22 03:08 1SAA

We have updated a lot. This issue was closed due to inactivity. Thanks.

Apr 13 '23 04:04 binmakeswell