ColossalAI
                                
                                 ColossalAI copied to clipboard
                                
                                    ColossalAI copied to clipboard
                            
                            
                            
                        [BUG]: RuntimeError: CUDA error: unknown error
🐛 Describe the bug
[04/17/23 20:35:20] INFO     colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p
ackages/colossalai/context/parallel_context.py:522
set_device
INFO     colossalai - colossalai - INFO: process rank 0 is
bound to device 0
[04/17/23 20:35:22] INFO     colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p
ackages/colossalai/context/parallel_context.py:558
set_seed
INFO     colossalai - colossalai - INFO: initialized seed on
rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1024,the default parallel seed is
ParallelMode.DATA.
INFO     colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p
ackages/colossalai/initialize.py:115 launch
INFO     colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 1,
pipeline parallel size: 1, tensor parallel size: 1
Files already downloaded and verified
Files already downloaded and verified
[04/17/23 20:35:27] INFO     colossalai - ProcessGroup - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p
ackages/colossalai/tensor/process_group.py:22
log_pg_init
INFO     colossalai - ProcessGroup - INFO: Pytorch
ProcessGroup Init:
backend: nccl
ranks: [0]
[extension] OP colossalai._C.cpu_adam has been compileed ahead of time, skip building.
[extension] OP colossalai._C.fused_optim has been compileed ahead of time, skip building.
[04/17/23 20:35:28] INFO     colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p
ackages/colossalai/zero/low_level/low_level_optim.p
y:251 _partition_param_list
INFO     colossalai - colossalai - INFO: Number of elements
on ranks: [23712932]
##################
Traceback (most recent call last):
File "/mnt/d/CIFAR100_timm/train_fgsm_colossai.py", line 236, in 
Environment
1060 6g i7 8750H WSL2 ubuntu20.04 不是代码的bug,代码刚写好可以运行,这个环境运行一段时间就会unkown error
You may try to decrease the batch size (if it is not already one). This might be due to CUDA OOM.