neuraltalk2 icon indicating copy to clipboard operation
neuraltalk2 copied to clipboard

train.lua error , cuda runtime error (2) : out of memory

Open fantine16 opened this issue 9 years ago • 7 comments

Hi, all I tried to train network on MSCOCO, i downloaded the dataset, and then run prepro.py ,and there were cocotalk.h5 and cocotalk.json under ./coco filefolder.

Then i tried to run the script:

$ th train.lua -input_h5 coco/cocotalk.h5 -input_json coco/cocotalk.json

And i get the next error message:

/home/liuchang/torch/install/bin/luajit: ./misc/optim_updates.lua:65: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-6827/cutorch/lib/THC/generic/THCStorage.cu:40 stack traceback: [C]: in function 'new' ./misc/optim_updates.lua:65: in function 'adam' /home/liuchang/neuraltalk2/train.lua:375: in main chunk [C]: in function 'dofile' ...hang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406620

My config:

  • os: ubuntu 14.04, 64bit
  • gpu: getforce gtx745, 4g
  • cuda: 7.0
  • cudnn:cudnn-7.0-linux-x64-v3.0-prod.tgz

I use Zerobrane to debug the train.lua step by step, and the error occured at line 387, when my GPU memory is occupied by 99%.

I wonder if my GPU memory is too small to train on MSCOCO ?

fantine16 avatar Jan 19 '16 03:01 fantine16

Hi there,

I encounter the same problem with GTX 970M 6G GPU for all ms coco 2014 train dataset ( about 400K images), but I train successfully for ms coco 2014 validation dataset (about 200K images).

It seems 6G GPU memory cannot hold 400K images, so I wander what's the hardware that can run mscoco 2014 train dataset successfully?

BTW, the neuraltalk2 is very great!

Thanks

yuhai-china avatar Jan 22 '16 03:01 yuhai-china

today I reinstall cutorch, and the issue is fixed.

yuhai-china avatar Jan 23 '16 10:01 yuhai-china

Hi there,

I have the same issue when I increase the training batch size from 16 to 64. So, what can I do to cut down some GPU memory usage? I suppose the CNN fine-tuning part costs a huge amount of GPU memory.

Thanks!

LuoweiZhou avatar May 04 '16 19:05 LuoweiZhou

@LuoweiZhou you should try to decrease the batch size

cuongduc avatar May 05 '16 03:05 cuongduc

It works fine for 16, but is there any method to decrease the GPU memory cost? I noticed that the clones of CNN net and the language model use so many memory... Thanks!

LuoweiZhou avatar May 05 '16 14:05 LuoweiZhou

Well, the amount of memory consumed is determined by the number of parameters of the model (including the number of parameters in CNN and LSTM). You can try to use smaller CNN (e.g., AlexNet), or use a smaller hidden size of LSTM, etc. However, all those tricks decrease the performance.

cuongduc avatar May 05 '16 14:05 cuongduc

Hi, I faced the same problem. I've downloaded MS COCO 2014 train and val. I have also downloaded the VGG_ILSVRC_16_layers.caffemodel. But when I run the command below, the out of memory error occurs:

th train.lua -input_h5 coco/cocotalk.h5 -input_json coco/cocotalk.json -checkpoint_path checkpoints

The error is like this:

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-5309/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory /home/linux/torch/install/bin/luajit: /home/linux/torch/install/share/lua/5.1/torch/File.lua:351: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-5309/cutorch/lib/THC/generic/THCStorage.cu:66 Any solutions? :(

P.S when I run it on CPU mode by passing the parameter -gpuid -1, it starts training. But I think it never ends!

Mozhdeh-d avatar Oct 29 '18 11:10 Mozhdeh-d