fast-neural-style icon indicating copy to clipboard operation
fast-neural-style copied to clipboard

cuda runtime error (2): out of memory

Open g0t0wasd opened this issue 8 years ago • 8 comments

When running this script th train.lua -h5_file /home/ubuntu/file.h5 -style_image /home/ubuntu/test.jpg -style_image_size 10 -content_weights 1.0 -style_weights 5.0 -checkpoint_name checkpoint -gpu 0 -num_iterations 500 -checkpoint_every 100 -use_cudnn 1

I'm getting next error:

Epoch 0.004735, Iteration 98 / 500, loss = 48087096.613525	0.001	
Epoch 0.004784, Iteration 99 / 500, loss = 45139398.947998	0.001	
Epoch 0.004832, Iteration 100 / 500, loss = 44173648.860840	0.001	
Running on validation set ... 	
val loss = 43854673.763330	
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-3427/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 13 module of nn.Sequential:
In 2 module of nn.Sequential:
/home/ubuntu/torch/install/share/lua/5.1/nn/CAddTable.lua:27: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-3427/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
	[C]: in function 'resizeAs'
	/home/ubuntu/torch/install/share/lua/5.1/nn/CAddTable.lua:27: in function 'updateGradInput'
	/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:29>
	[C]: in function 'xpcall'
	/home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function </home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:78>
	[C]: in function 'xpcall'
	/home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
	train.lua:211: in function 'opfunc'
	/home/ubuntu/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam'
	train.lua:239: in function 'main'
	train.lua:327: in main chunk
	[C]: in function 'dofile'
	...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
	[C]: in function 'error'
	/home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
	/home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
	train.lua:211: in function 'opfunc'
	/home/ubuntu/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam'
	train.lua:239: in function 'main'
	train.lua:327: in main chunk
	[C]: in function 'dofile'
	...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x00405d50

What might cause this issue, how much GPU do I need to train a model?

g0t0wasd avatar Feb 01 '17 16:02 g0t0wasd

I've noticed that the validation/checkpointing routine seems to consume memory each time it runs and does not free it back. Thus as it validates, it uses more and more memory. As you see in your example, you were able to run and train the model up until the validation. After validation, you ran up against your memory limit.

I added a 'cmd:option('-do_checkpoint',1)' option and an If/Then wrap around validation to be able to turn it off. You'll also need to add a 'model save' at the end of the For T iterations loop otherwise your final model won't get saved if you turn of checkpointing. In #99 I posted a code snippet I put at the end of train.lua to save the final model.

One of my style images is large (600 x 800) and if I run it without checkpointing, my memory usage stays consistent at about 6600MB. If I run it with 5000 iterations and checkpoint =1000, then by the 5000th iteration, my GPU usage will have crept up to 8100 MB which is just about the limits of my GTX-1070

As I'm just getting started with Torch and Lua, I'm not yet able to figure out why the validation/Checkpointing conditional seems to cause train.lua to continue to use more memory each checkpoint without releasing it back. Seems like after the checkpoint is performed all memory associated with it should be released, but that does not appear to be the case. Perhaps wiser users than me might have a solution to this.

If you use a smaller style_image_size, you'll require less memory for training.

filmo avatar Feb 05 '17 01:02 filmo

I was able to fix it by decreasing -batch_size. It's set to 4 by default. Try adding '-batch_size 3' flag.

alexbtlv avatar Feb 06 '17 19:02 alexbtlv

@alexbtlv That helped, thank you!

g0t0wasd avatar Feb 10 '17 10:02 g0t0wasd

Out of memory could be due to lack of memory such as setting a too large batch_size. And another situation is you could run your codes before, but after finish or stop by user, you couldn't. Sometimes a process may occupy too much GPU memory (nvidia-smi helps check it), you can try to kill this process. I have encountered /usr/lib/xorg/Xorg occupies around 1800 Mb GPU memory, and after I killed it, it works out well.

panfengli avatar Feb 26 '17 20:02 panfengli

Has anyone figured this out? I can successfully train a model if -checkpoint_every is equal to -num_iterations, but the same command will run out of memory if I try to do more than one checkpoint.

raustaburk avatar Apr 08 '17 06:04 raustaburk

@raustaburk Decrease batch size as @alexbtlv suggested. Worked for me

g0t0wasd avatar Apr 08 '17 06:04 g0t0wasd

@g0t0wasd It works to a point, but it doesn't solve the core problem that checkpointing doesn't seem to free up memory.

raustaburk avatar Apr 09 '17 00:04 raustaburk

I also encounter the problem, that train.lua runs out of memory as soon as a checkpoint is stored. I can also continue training from this checkpoint with -checkpoint_name c -resume_from_checkpoint c.t7. Edit: I was logging in via Teamviewer (remote control software) when I saw this error. Today I am sitting in front of the machine and the error is gone!

flaushi avatar Feb 15 '18 08:02 flaushi