3D-ResNets-PyTorch
3D-ResNets-PyTorch copied to clipboard
RuntimeError: CUDA error: out of memory
I'm trying to fine tune ResNet-34 for UCF101 dataset with this command
python main.py --root_path ~/Downloads/data --video_path UCF101/UCF101_jpg --annotation_path UCF101/ucfTrainTestlist/ucf101_01.json
--result_path UCF101/results --dataset ucf101 --n_classes 400 --n_finetune_classes 101
--pretrain_path models/resnet-34-kinetics.pth --ft_begin_index 4
--model resnet --model_depth 34 --resnet_shortcut A --batch_size 32 --n_threads 4 --checkpoint 5
Here's how my training starts
dataset loading [0/9537] dataset loading [1000/9537] dataset loading [2000/9537] dataset loading [3000/9537] dataset loading [4000/9537] dataset loading [5000/9537] dataset loading [6000/9537] dataset loading [7000/9537] dataset loading [8000/9537] dataset loading [9000/9537] dataset loading [0/3783] dataset loading [1000/3783] dataset loading [2000/3783] dataset loading [3000/3783] run train at epoch 1 Epoch: [1][1/297] Time 0.912 (0.912) Data 0.733 (0.733) Loss 4.7467 (4.7467) Acc 0.062 (0.062) Epoch: [1][2/297] Time 0.480 (0.696) Data 0.327 (0.530) Loss 4.6803 (4.7135) Acc 0.062 (0.062) ....
But near the end of epoch 1 I get this error
Epoch: [1][296/297] Time 0.465 (0.463) Data 0.328 (0.327) Loss 1.5035 (2.0112) Acc 0.594 (0.503)
Traceback (most recent call last):
File "main.py", line 137, in
I am using a single NVIDIA GTX 1080 Ti GPU with 11 GB memory. nvidia-smi clearly shows that at no point of time the memory utilization exceeds 3 GB. I have tried resizing the input images to smaller sizes. I tried different batch sizes, even different networks and training from scratch. In all cases The same error comes right at the last iteration of the 1st epoch. I'm using PyTorch version 0.4.1. I'm able to run AlexNet training with IMAGENET dataset in PyTorch on the GPU without any problem. Please let me know the reason for this problem and how to fix it.
I got the same problem using the same machine you indicated above. I shift to another computer with 8 gpu and it is working right now
Ok. When I tried with a batch size of 8 it started working on the same machine. But the GPU memory utilization is less than 2 GB.
Ok. When I tried with a batch size of 8 it started working on the same machine. But the GPU memory utilization is less than 2 GB.
It is the same situation for me, it uses around 1.7GB of the memory, but a batch size of 16 would give me out of memory error. May I know if you have figured out the reason? Thanks!
I met the same problem. When I use batch size=8 to 14, GPU memory is about 2000M. But when I increase batch size to 15 the memory becomes 12000M. I think it's abnormal and haven't figured out
I'm having the same problem but unlike the comments in this thread, when I minimize the batch size I'm having this error:
IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number
Anyone knows how to solve it? Thanks in advance
Using the resnext-101 pretrained model.
@masju96 I'm having this problem too in resnext-101_32x4d model from pytorch pretrainedmodels.
I'm trying out DataParallel
to spread the loads in my GPU cores. I'll get back to you if mine works
@masju96 @iqDF I have had this issue too, it happened after I have manually upgraded Pytorch one day. Downgrade to an older version of pytorch would solve this issue. I don't remember the exact version number. But anything from 2018 or earlier should work.
I think it's probably due to the num_worker in dataloader
In this issue #16417 there is some useful information about this error. I'm figuring out the same erro in the final of first epoch (activitynet).