neuraltalk2
neuraltalk2 copied to clipboard
train.lua error , cuda runtime error (2) : out of memory
Hi, all I tried to train network on MSCOCO, i downloaded the dataset, and then run prepro.py ,and there were cocotalk.h5 and cocotalk.json under ./coco filefolder.
Then i tried to run the script:
$ th train.lua -input_h5 coco/cocotalk.h5 -input_json coco/cocotalk.json
And i get the next error message:
/home/liuchang/torch/install/bin/luajit: ./misc/optim_updates.lua:65: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-6827/cutorch/lib/THC/generic/THCStorage.cu:40 stack traceback: [C]: in function 'new' ./misc/optim_updates.lua:65: in function 'adam' /home/liuchang/neuraltalk2/train.lua:375: in main chunk [C]: in function 'dofile' ...hang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406620
My config:
- os: ubuntu 14.04, 64bit
- gpu: getforce gtx745, 4g
- cuda: 7.0
- cudnn:cudnn-7.0-linux-x64-v3.0-prod.tgz
I use Zerobrane to debug the train.lua step by step, and the error occured at line 387, when my GPU memory is occupied by 99%.
I wonder if my GPU memory is too small to train on MSCOCO ?
Hi there,
I encounter the same problem with GTX 970M 6G GPU for all ms coco 2014 train dataset ( about 400K images), but I train successfully for ms coco 2014 validation dataset (about 200K images).
It seems 6G GPU memory cannot hold 400K images, so I wander what's the hardware that can run mscoco 2014 train dataset successfully?
BTW, the neuraltalk2 is very great!
Thanks
today I reinstall cutorch, and the issue is fixed.
Hi there,
I have the same issue when I increase the training batch size from 16 to 64. So, what can I do to cut down some GPU memory usage? I suppose the CNN fine-tuning part costs a huge amount of GPU memory.
Thanks!
@LuoweiZhou you should try to decrease the batch size
It works fine for 16, but is there any method to decrease the GPU memory cost? I noticed that the clones of CNN net and the language model use so many memory... Thanks!
Well, the amount of memory consumed is determined by the number of parameters of the model (including the number of parameters in CNN and LSTM). You can try to use smaller CNN (e.g., AlexNet), or use a smaller hidden size of LSTM, etc. However, all those tricks decrease the performance.
Hi, I faced the same problem. I've downloaded MS COCO 2014 train and val. I have also downloaded the VGG_ILSVRC_16_layers.caffemodel. But when I run the command below, the out of memory error occurs:
th train.lua -input_h5 coco/cocotalk.h5 -input_json coco/cocotalk.json -checkpoint_path checkpoints
The error is like this:
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-5309/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory /home/linux/torch/install/bin/luajit: /home/linux/torch/install/share/lua/5.1/torch/File.lua:351: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-5309/cutorch/lib/THC/generic/THCStorage.cu:66
Any solutions? :(
P.S when I run it on CPU mode by passing the parameter -gpuid -1, it starts training. But I think it never ends!