YOLOX icon indicating copy to clipboard operation
YOLOX copied to clipboard

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch)

Open haomayang1126 opened this issue 2 years ago • 7 comments

输入命令刚开始执行,显卡的内存就被占满了,有时候在第一轮出错,有时候在第三轮, num_works 设置了4,2,0都是同样的问题, 环境 python 3.8 pytorch1.8.1 cuda10.1

============================================================================== 2021-07-24 17:55:22 | INFO | yolox.core.trainer:188 - ---> start train epoch1 2021-07-24 17:55:26 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 2021-07-24 17:55:28 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 2021-07-24 17:55:30 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0 2021-07-24 17:55:37 | INFO | yolox.core.trainer:237 - epoch: 1/30, iter: 10/40, mem: 4660Mb, iter_time: 1.570s, data_time: 0.867s, total_loss: 11.0, iou_loss: 3.0, l1_loss: 0.0, conf_ loss: 5.7, cls_loss: 2.3, lr: 1.953e-06, size: 640, ETA: 0:31:08 2021-07-24 17:55:43 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0 2021-07-24 17:55:51 | INFO | yolox.core.trainer:237 - epoch: 1/30, iter: 20/40, mem: 4660Mb, iter_time: 1.333s, data_time: 0.762s, total_loss: 10.1, iou_loss: 2.8, l1_loss: 0.0, conf_ loss: 4.5, cls_loss: 2.8, lr: 7.813e-06, size: 576, ETA: 0:28:32 2021-07-24 17:55:53 | INFO | yolox.core.trainer:183 - Training of experiment is done and the best AP is 0.00 2021-07-24 17:55:53 | ERROR | yolox.core.launch:73 - An error has been caught in function 'launch', process 'MainProcess' (5488), thread 'MainThread' (6852): Traceback (most recent call last):

File "tools\train.py", line 111, in launch( └ <function launch at 0x00000126EC829E50>

File "g:\pythonproject\yolox-main\yolox\core\launch.py", line 73, in launch main_func(*args) │ └ (╒══════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════... └ <function main at 0x00000126EEA76DC0>

File "tools\train.py", line 101, in main trainer.train() │ └ <function Trainer.train at 0x00000126EDDBCD30> └ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 70, in train self.train_in_epoch() │ └ <function Trainer.train_in_epoch at 0x00000126EEA44F70> └ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 79, in train_in_epoch self.train_in_iter() │ └ <function Trainer.train_in_iter at 0x00000126EEA55280> └ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 85, in train_in_iter self.train_one_iter() │ └ <function Trainer.train_one_iter at 0x00000126EEA55310> └ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 91, in train_one_iter inps, targets = self.prefetcher.next() │ │ └ <function DataPrefetcher.next at 0x00000126EDDBC310> │ └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0> └ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 48, in next self.preload() │ └ <function DataPrefetcher.preload at 0x00000126EDDBC280> └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>

File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 37, in preload self.input_cuda() │ └ <bound method DataPrefetcher._input_cuda_for_image of <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>> └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>

File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 52, in _input_cuda_for_image self.next_input = self.next_input.cuda(non_blocking=True) │ │ │ │ └ <method 'cuda' of 'torch._C._TensorBase' objects> │ │ │ └ tensor([[[[ 0.1426, 0.1426, 0.1254, ..., -0.5253, -0.5424, -0.5424], │ │ │ [ 0.1426, 0.1426, 0.1254, ..., -0.5424, ... │ │ └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0> │ └ tensor([[[[ 0.1426, 0.1426, 0.1254, ..., -0.5253, -0.5424, -0.5424], │ [ 0.1426, 0.1426, 0.1254, ..., -0.5424, ... └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch)

(Swin) G:\Pythonproject\YOLOX-main>

haomayang1126 avatar Jul 24 '21 09:07 haomayang1126

It's a known error #91 . We are working on it now.

Joker316701882 avatar Jul 25 '21 02:07 Joker316701882

试试训练时把-o指令去掉

1VeniVediVeci1 avatar Jul 25 '21 03:07 1VeniVediVeci1

试试训练时把-o指令去掉

thx~ 问题解决了

haomayang1126 avatar Jul 26 '21 09:07 haomayang1126

It's a known error #91 . We are working on it now.

delete -o in command,it works

haomayang1126 avatar Jul 26 '21 09:07 haomayang1126

试试训练时把-o指令去掉

我去掉-o之后还是出现这个问题: RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 10.76 GiB total capacity; 9.62 GiB already allocated; 27.50 MiB free; 9.72 GiB reserved in total by PyTorch)

lyp-oss avatar Jul 27 '21 23:07 lyp-oss

Then you have to reduce your batchsize or choose a small model like yolox-tiny or yolox-s

GOATmessi7 avatar Jul 28 '21 00:07 GOATmessi7

Hi,

I didn't occurred this problem according of training, but when I was tried to test our model after converted using trt, I found same problem how can I solve it ?

Thank you

LamnouarMohamed avatar Sep 13 '22 13:09 LamnouarMohamed