torch-rnn
torch-rnn copied to clipboard
Checkpoints not being written
th train.lua -input_h5 data.h5 -input_json data.json -rnn_size 2068 -dropout 0.3 -num_layers 3 -checkpoint_every 100
comes to a stand still when it's time to write a check point. I also tried at checkpoint_every 1000.
That's a pretty big RNN! It's a bit slow but runs fine on my system. Maybe you are running out of memory and swapping to disk? That would slow things down.
You are running that RNN on a GPU, right? What happens if you set -memory_benchmark to 1? BTW, what video card(s?) have you got?
Justin, just curious, how big is the checkpoint file for this 3x2068? E.g. torch.saving a (not-reset/cleared, with all grads, etc.) 13.2GB NN takes 130-140sec on my system, what does look like it stands still.
The .t7 checkpoint is 659MB, and takes maybe 10 seconds to save on my system.
On Thu, Apr 7, 2016 at 3:38 PM, AlekzNet [email protected] wrote:
Justin, just curious, how big is the checkpoint file for this 3x2068? E.g. saving a (not-reset/cleared, with all grads, etc.) 10.2GB NN takes 130-140sec on my system.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/jcjohnson/torch-rnn/issues/59#issuecomment-207122125
@rubinovitz What does [h]top say?
I'm having the same issue, running
th train.lua -input_h5 my_data.h5 -input_json my_data.json -model_type rnn -num_layers 3 -rnn_size 256
where my_data.h5
is about 500mb. th
hangs indefinitely when reaching the checkpoint. This is when running with -memory_benchmark 0
. If I set it to 1
I get
Running with OpenCL on GPU 0
/home/kejace/torch/torch-cl/install/bin/luajit: train.lua:112: assertion failed!
stack traceback:
[C]: in function 'assert'
train.lua:112: in main chunk
[C]: in function 'dofile'
...h/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
Note that I'm using cltorch
.
htop
tells me that it is still using CPU:
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
25241 kejace 20 0 5571M 1134M 53308 S 106. 29.0 1:30.39 /home/kejace/torch/torch-c
25401 kejace 20 0 5571M 1136M 53344 R 90.2 29.0 1:21.29 /home/kejace/torch/torch-c
Having the same issue here
Using cltorch
.
Taking forever to save checkpoints. It takes about a minute to save a checkpoint for tiny-shakespeare input and forever (was waiting for about an hour) to save a checkpoint for a 48 MB input file. It is still doing something since its loading CPU and GPU while trying to save that checkpoint:
gerbill@ubgerbill:~$ aticonfig --odgc --odgt
Default Adapter - Supported device 67B1
Core (MHz) Memory (MHz)
Current Clocks : 947 1250
Current Peak : 947 1250
Configurable Peak Range : [300-1500] [150-2000]
GPU load : 100%
Default Adapter - Supported device 67B1
Sensor 0: Temperature - 92.00 C
Previously used amazon aws GPU instance (Nvidia GPU powered) and I haven't experienced anything like that, everything was working smooth.
BTW, I am having the same issue as kejace has when setting -memory_benchmark
to 1
I wonder if this is an OpenCL related issue? @rubinovitz did you run this on NVIDIA or AMD?
I'm guessing its this loop taking so long to process
for j = 1, num_val do
local xv, yv = loader:nextBatch('val')
xv = xv:type(dtype)
yv = yv:type(dtype):view(N * T)
local scores = model:forward(xv):view(N * T, -1)
val_loss = val_loss + crit:forward(scores, yv)
end
If I set num_val = 2
the script will no longer loiter and save checkpoint files in about 10 seconds. Not sure what val_loss means in the end of the day. Will this value be needed for text sampling later? Or is it outputted just to inform me of some avarage loss over the previous iterations?
You don't need the validation loss for sampling, but comparing validation loss with training loss is a good way to check whether your model is overfitting. To make this part go faster you can use a smaller validation set (by setting a smaller value for --val_frac
in the preprocess.py
script).