fast-neural-style
fast-neural-style copied to clipboard
cuda runtime error (2): out of memory
When running this script
th train.lua -h5_file /home/ubuntu/file.h5 -style_image /home/ubuntu/test.jpg -style_image_size 10 -content_weights 1.0 -style_weights 5.0 -checkpoint_name checkpoint -gpu 0 -num_iterations 500 -checkpoint_every 100 -use_cudnn 1
I'm getting next error:
Epoch 0.004735, Iteration 98 / 500, loss = 48087096.613525 0.001
Epoch 0.004784, Iteration 99 / 500, loss = 45139398.947998 0.001
Epoch 0.004832, Iteration 100 / 500, loss = 44173648.860840 0.001
Running on validation set ...
val loss = 43854673.763330
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-3427/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67:
In 13 module of nn.Sequential:
In 2 module of nn.Sequential:
/home/ubuntu/torch/install/share/lua/5.1/nn/CAddTable.lua:27: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-3427/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
[C]: in function 'resizeAs'
/home/ubuntu/torch/install/share/lua/5.1/nn/CAddTable.lua:27: in function 'updateGradInput'
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:29>
[C]: in function 'xpcall'
/home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function </home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
train.lua:211: in function 'opfunc'
/home/ubuntu/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam'
train.lua:239: in function 'main'
train.lua:327: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
train.lua:211: in function 'opfunc'
/home/ubuntu/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam'
train.lua:239: in function 'main'
train.lua:327: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
What might cause this issue, how much GPU do I need to train a model?
I've noticed that the validation/checkpointing routine seems to consume memory each time it runs and does not free it back. Thus as it validates, it uses more and more memory. As you see in your example, you were able to run and train the model up until the validation. After validation, you ran up against your memory limit.
I added a 'cmd:option('-do_checkpoint',1)' option and an If/Then wrap around validation to be able to turn it off. You'll also need to add a 'model save' at the end of the For T iterations loop otherwise your final model won't get saved if you turn of checkpointing. In #99 I posted a code snippet I put at the end of train.lua to save the final model.
One of my style images is large (600 x 800) and if I run it without checkpointing, my memory usage stays consistent at about 6600MB. If I run it with 5000 iterations and checkpoint =1000, then by the 5000th iteration, my GPU usage will have crept up to 8100 MB which is just about the limits of my GTX-1070
As I'm just getting started with Torch and Lua, I'm not yet able to figure out why the validation/Checkpointing conditional seems to cause train.lua to continue to use more memory each checkpoint without releasing it back. Seems like after the checkpoint is performed all memory associated with it should be released, but that does not appear to be the case. Perhaps wiser users than me might have a solution to this.
If you use a smaller style_image_size, you'll require less memory for training.
I was able to fix it by decreasing -batch_size. It's set to 4 by default. Try adding '-batch_size 3' flag.
@alexbtlv That helped, thank you!
Out of memory could be due to lack of memory such as setting a too large batch_size. And another situation is you could run your codes before, but after finish or stop by user, you couldn't. Sometimes a process may occupy too much GPU memory (nvidia-smi
helps check it), you can try to kill this process. I have encountered /usr/lib/xorg/Xorg
occupies around 1800 Mb GPU memory, and after I killed it, it works out well.
Has anyone figured this out? I can successfully train a model if -checkpoint_every is equal to -num_iterations, but the same command will run out of memory if I try to do more than one checkpoint.
@raustaburk Decrease batch size as @alexbtlv suggested. Worked for me
@g0t0wasd It works to a point, but it doesn't solve the core problem that checkpointing doesn't seem to free up memory.
I also encounter the problem, that train.lua runs out of memory as soon as a checkpoint is stored. I can also continue training from this checkpoint with -checkpoint_name c -resume_from_checkpoint c.t7
.
Edit:
I was logging in via Teamviewer (remote control software) when I saw this error. Today I am sitting in front of the machine and the error is gone!