neuralconvo
neuralconvo copied to clipboard
The memory usage skyrocket each time it saves
And it doesn't free the memory. I executed this bash command
th train.lua --cuda --dataset 50000 --hiddenSize 1000
First epoch it consumed 2GiB Ram, and second it consumed 5GiB, then 10GiB and finally my memory was full at 11th epoch. (My computer have 32 GiB of ram)
This issue disappeared when I commented out line 156 to 171 in train.lua(The ram usage is always at 1.2GiB)
if minMeanError == nil or errors:mean() < minMeanError then
print("\n(Saving model ...)")
params, gradParams = nil,nil
collectgarbage()
-- Model is saved as CPU
model:float()
torch.save("data/model.t7", model)
collectgarbage()
if options.cuda then
model:cuda()
elseif options.opencl then
model:cl()
end
collectgarbage()
minMeanError = errors:mean()
end
So I conclude the saving process may be the problem
Seems to occur in the calls to model:float()
. My workaround was to just save in GPU format:
if minMeanError == nil or errors:mean() < minMeanError then
print("\n(Saving model ...)")
params, gradParams = nil,nil
collectgarbage()
torch.save("data/model.t7", model)
collectgarbage()
minMeanError = errors:mean()
end
I then added require 'cudnn'
to the top of eval.lua in order to be able to load the saved model. If you want to save the model in CPU format, you could write a quick script to load the model, call model:float(), and save it again.
Thanks for your simple solution, @Namburgesas . Hope there's a fix in the future
Did you try doing clearState() before using model:float(). It clears the intermediary states in the model (not needed for prediction)