Robust-Lane-Detection icon indicating copy to clipboard operation
Robust-Lane-Detection copied to clipboard

CUDA out of memory.

Open peterlee909 opened this issue 5 years ago • 5 comments

/root/data/anaconda3/envs/py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:100: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) Traceback (most recent call last): File "train.py", line 132, in train(args, epoch, model, train_loader, device, optimizer, criterion) File "train.py", line 18, in train output = model(data) File "/root/data/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/root/data/AIModel/Robust-Lane-Detection/LaneDetectionCode/model.py", line 53, in forward x1 = self.inc(item) File "/root/data/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/root/data/AIModel/Robust-Lane-Detection/LaneDetectionCode/utils.py", line 32, in forward x = self.conv(x) File "/root/data/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/root/data/AIModel/Robust-Lane-Detection/LaneDetectionCode/utils.py", line 22, in forward x = self.conv(x) File "/root/data/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/root/data/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/root/data/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/root/data/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 345, in forward return self.conv2d_forward(input, self.weight) File "/root/data/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward self.padding, self.dilation, self.groups) RuntimeError: CUDA out of memory. Tried to allocate 4.98 GiB (GPU 0; 15.90 GiB total capacity; 7.36 GiB already allocated; 1.07 GiB free; 6.77 GiB cached)

I have a 16GB GPU, but I keep getting this error. I was wondering how you trained on two GPUs. By the way, I am using Pytorch 1.3.1. Thanks for your help!

peterlee909 avatar Dec 19 '19 09:12 peterlee909

The releaased code is running on Pytorch0.4.0. If adapted to Pytorch1.1.0 or above, optimizer.step()' should be before lr_scheduler.step()'. For the problem of out of memory, you may have to set a smaller batch size.

qinnzou avatar Dec 19 '19 12:12 qinnzou

Thank you for your answer. I have set the batch size to 3, but it still didn't work. It really confused me that a 16GB GPU was out of memory. As I know, Pytorch won't take multi-GPU itself. Can you please tell me how did you do that? Thank you so much!

peterlee909 avatar Dec 20 '19 01:12 peterlee909

I found I forgot to resize the image causing to the memory problem. However, I face another problem.

Traceback (most recent call last): File "c:/Users/10806337/Desktop/PortableGit/Projects/Robust-Lane-Detection/LaneDetectionCode/train.py", line 122, in train(args, epoch, model, train_loader, device, optimizer, criterion) File "c:/Users/10806337/Desktop/PortableGit/Projects/Robust-Lane-Detection/LaneDetectionCode/train.py", line 21, in train loss.backward() File "C:\Users\10806337.conda\envs\RobustLaneDetection\lib\site-packages\torch\tensor.py", line 118, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "C:\Users\10806337.conda\envs\RobustLaneDetection\lib\site-packages\torch\autograd_init_.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Do you have any idea of it?

peterlee909 avatar Dec 23 '19 02:12 peterlee909

I have solved my problem. I accidentally add torch.no_grad(): before the training loop. But I found one bug in the code that is in train(): loss = criterion(output, target) should be loss = criterion(output[0], target)

peterlee909 avatar Dec 23 '19 06:12 peterlee909

Do you get the results?

qinnzou avatar Jan 04 '20 07:01 qinnzou