3D-ResNets-PyTorch icon indicating copy to clipboard operation
3D-ResNets-PyTorch copied to clipboard

RuntimeError: CUDA error: out of memory

Open harsh-wardhan opened this issue 5 years ago • 9 comments

I'm trying to fine tune ResNet-34 for UCF101 dataset with this command

python main.py --root_path ~/Downloads/data --video_path UCF101/UCF101_jpg --annotation_path UCF101/ucfTrainTestlist/ucf101_01.json
--result_path UCF101/results --dataset ucf101 --n_classes 400 --n_finetune_classes 101
--pretrain_path models/resnet-34-kinetics.pth --ft_begin_index 4
--model resnet --model_depth 34 --resnet_shortcut A --batch_size 32 --n_threads 4 --checkpoint 5

Here's how my training starts

dataset loading [0/9537] dataset loading [1000/9537] dataset loading [2000/9537] dataset loading [3000/9537] dataset loading [4000/9537] dataset loading [5000/9537] dataset loading [6000/9537] dataset loading [7000/9537] dataset loading [8000/9537] dataset loading [9000/9537] dataset loading [0/3783] dataset loading [1000/3783] dataset loading [2000/3783] dataset loading [3000/3783] run train at epoch 1 Epoch: [1][1/297] Time 0.912 (0.912) Data 0.733 (0.733) Loss 4.7467 (4.7467) Acc 0.062 (0.062) Epoch: [1][2/297] Time 0.480 (0.696) Data 0.327 (0.530) Loss 4.6803 (4.7135) Acc 0.062 (0.062) ....

But near the end of epoch 1 I get this error

Epoch: [1][296/297] Time 0.465 (0.463) Data 0.328 (0.327) Loss 1.5035 (2.0112) Acc 0.594 (0.503) Traceback (most recent call last): File "main.py", line 137, in train_logger, train_batch_logger) File "/home/testuser/harsh/3D-ResNets-PyTorch/train.py", line 29, in train_epoch outputs = model(inputs)#; print('Check 5: {} '.format(pytorch_total_params)) File "/home/testuser/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs) File "/home/testuser/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/testuser/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs) File "/home/testuser/harsh/3D-ResNets-PyTorch/models/resnet.py", line 184, in forward x = self.layer4(x) File "/home/testuser/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs) File "/home/testuser/anaconda3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/home/testuser/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs) File "/home/testuser/harsh/3D-ResNets-PyTorch/models/resnet.py", line 58, in forward out = self.conv2(out) File "/home/testuser/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs) File "/home/testuser/anaconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 421, in forward self.padding, self.dilation, self.groups) RuntimeError: CUDA error: out of memory

I am using a single NVIDIA GTX 1080 Ti GPU with 11 GB memory. nvidia-smi clearly shows that at no point of time the memory utilization exceeds 3 GB. I have tried resizing the input images to smaller sizes. I tried different batch sizes, even different networks and training from scratch. In all cases The same error comes right at the last iteration of the 1st epoch. I'm using PyTorch version 0.4.1. I'm able to run AlexNet training with IMAGENET dataset in PyTorch on the GPU without any problem. Please let me know the reason for this problem and how to fix it.

harsh-wardhan avatar Oct 11 '18 08:10 harsh-wardhan

I got the same problem using the same machine you indicated above. I shift to another computer with 8 gpu and it is working right now

ibsakum avatar Nov 26 '18 15:11 ibsakum

Ok. When I tried with a batch size of 8 it started working on the same machine. But the GPU memory utilization is less than 2 GB.

harsh-wardhan avatar Nov 29 '18 10:11 harsh-wardhan

Ok. When I tried with a batch size of 8 it started working on the same machine. But the GPU memory utilization is less than 2 GB.

It is the same situation for me, it uses around 1.7GB of the memory, but a batch size of 16 would give me out of memory error. May I know if you have figured out the reason? Thanks!

fnj1017 avatar Jan 25 '19 07:01 fnj1017

I met the same problem. When I use batch size=8 to 14, GPU memory is about 2000M. But when I increase batch size to 15 the memory becomes 12000M. I think it's abnormal and haven't figured out

ziqi-zhang avatar May 02 '19 13:05 ziqi-zhang

I'm having the same problem but unlike the comments in this thread, when I minimize the batch size I'm having this error:

IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

Anyone knows how to solve it? Thanks in advance

Using the resnext-101 pretrained model.

masju96 avatar Jun 07 '19 11:06 masju96

@masju96 I'm having this problem too in resnext-101_32x4d model from pytorch pretrainedmodels. I'm trying out DataParallel to spread the loads in my GPU cores. I'll get back to you if mine works

danielkurniadi avatar Jul 09 '19 05:07 danielkurniadi

@masju96 @iqDF I have had this issue too, it happened after I have manually upgraded Pytorch one day. Downgrade to an older version of pytorch would solve this issue. I don't remember the exact version number. But anything from 2018 or earlier should work.

fnj1017 avatar Jul 09 '19 05:07 fnj1017

I think it's probably due to the num_worker in dataloader

iuhiyuh avatar Jul 09 '19 09:07 iuhiyuh

In this issue #16417 there is some useful information about this error. I'm figuring out the same erro in the final of first epoch (activitynet).

guilhermesurek avatar Jun 12 '20 21:06 guilhermesurek