EfficientDet.Pytorch icon indicating copy to clipboard operation
EfficientDet.Pytorch copied to clipboard

6G GPU memory,batch_size=1 with D1 network,still got CUDA out of memory

Open AlexLuya opened this issue 4 years ago • 11 comments

Your default batch size is 32,What GPU did you used for training?

AlexLuya avatar Dec 14 '19 13:12 AlexLuya

Same. 2080TI (11GB) with batch_size = 1 still not work. Here's the traceback:

Traceback (most recent call last):
  File "train.py", line 195, in <module>
    train()
  File "train.py", line 140, in train
    classification, regression, anchors = model(images)
  File "/home/ray/anaconda3/envs/dl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ray/EfficientDetPytorch/models/efficientdet.py", line 62, in forward
    anchors = self.anchors(inputs)
  File "/home/ray/anaconda3/envs/dl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ray/EfficientDetPytorch/models/module.py", line 153, in forward
    return torch.from_numpy(all_anchors.astype(np.float32)).cuda()
RuntimeError: CUDA error: out of memory

RayOnFire avatar Dec 15 '19 15:12 RayOnFire

You can try NVIDIA apex with opt_level = 'O2, I got 8100M GPU memory usage with batch size 16, you can try to use smaller batch size to fit in 6GB GPU RAM.

RayOnFire avatar Dec 15 '19 17:12 RayOnFire

Same problem. Two 2080TI (11GB*2) with batch_size = 6 . Here's the traceback: Traceback (most recent call last): File "C:/Users/Admin/Desktop/EfficientDet.Pytorch-master/train.py", line 196, in <module> train() File "C:/Users/Admin/Desktop/EfficientDet.Pytorch-master/train.py", line 141, in train classification, regression, anchors = model(images) File "C:\Users\Admin\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "C:\Users\Admin\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\parallel\data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "C:\Users\Admin\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\parallel\data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "C:\Users\Admin\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 85, in parallel_apply output.reraise() File "C:\Users\Admin\Anaconda3\envs\pytorch\lib\site-packages\torch\_utils.py", line 385, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "C:\Users\Admin\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "C:\Users\Admin\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "C:\Users\Admin\Desktop\EfficientDet.Pytorch-master\models\efficientdet.py", line 59, in forward features = self.BIFPN(features[-5:]) File "C:\Users\Admin\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "C:\Users\Admin\Desktop\EfficientDet.Pytorch-master\models\bifpn.py", line 109, in forward laterals = bifpn_module(laterals) File "C:\Users\Admin\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "C:\Users\Admin\Desktop\EfficientDet.Pytorch-master\models\bifpn.py", line 196, in forward pathtd[i], scale_factor=2, mode='nearest') RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.00 GiB total capacity; 5.97 GiB already allocated; 678.40 KiB free; 38.08 MiB cached)

shengyuqing avatar Dec 16 '19 09:12 shengyuqing

@AlexLuya @RayOnFire @shengyuqing I used: OS: Ubuntu 18.04 GPU: 2*2080TI(11GB) When training, I set batch_size 32 for EffficientDet-D0(~20000MB cuda), and batch_size 16 for EfficientDet-Do(~20000MB cuda). At commit #36 , If you use multi-GPU, I have changed .cuda() in loss function and Anchor to .to(input.device). I think it will fix this issues.

toandaominh1997 avatar Dec 17 '19 03:12 toandaominh1997

@AlexLuya @RayOnFire @shengyuqing I used: OS: Ubuntu 18.04 GPU: 2*2080TI(11GB) When training, I set batch_size 32 for EffficientDet-D0(~20000MB cuda), and batch_size 16 for EfficientDet-Do(~20000MB cuda). At commit #36 , If you use multi-GPU, I have changed .cuda() in loss function and Anchor to .to(input.device). I think it will fix this issues.

Thanks! I have updated the code, but still the same problem. Very strange.

shengyuqing avatar Dec 17 '19 09:12 shengyuqing

@toandaominh1997 I used Windows10

shengyuqing avatar Dec 17 '19 09:12 shengyuqing

but, i want to use d0-d7, just one 2080Ti, and batch_size >=4 for any backbone, and input shape >=(448,448) or (640, 640) it's seems that, the basic backbone limit the input shape, and need more cuda memory, not like the paper said....more light, more efficient.

foocker avatar Dec 24 '19 02:12 foocker

@AlexLuya @RayOnFire @shengyuqing I used: OS: Ubuntu 18.04 GPU: 2*2080TI(11GB) When training, I set batch_size 32 for EffficientDet-D0(~20000MB cuda), and batch_size 16 for EfficientDet-Do(~20000MB cuda). At commit #36 , If you use multi-GPU, I have changed .cuda() in loss function and Anchor to .to(input.device). I think it will fix this issues.

I don't understand the explicit way?

qtw1998 avatar Dec 29 '19 11:12 qtw1998

@toandaominh1997 I used Windows10

have U solve the problem?

qtw1998 avatar Dec 29 '19 11:12 qtw1998

have you solved the out of memory ?

Jasper-Bai avatar Jan 06 '20 11:01 Jasper-Bai

I got the same problem on my Titan rtx

yaoliUoA avatar Feb 21 '20 00:02 yaoliUoA