torch-rnn icon indicating copy to clipboard operation
torch-rnn copied to clipboard

Checkpoints not being written

Open rubinovitz opened this issue 8 years ago • 10 comments

th train.lua -input_h5 data.h5 -input_json data.json -rnn_size 2068 -dropout 0.3 -num_layers 3 -checkpoint_every 100

comes to a stand still when it's time to write a check point. I also tried at checkpoint_every 1000.

rubinovitz avatar Apr 07 '16 21:04 rubinovitz

That's a pretty big RNN! It's a bit slow but runs fine on my system. Maybe you are running out of memory and swapping to disk? That would slow things down.

jcjohnson avatar Apr 07 '16 21:04 jcjohnson

You are running that RNN on a GPU, right? What happens if you set -memory_benchmark to 1? BTW, what video card(s?) have you got?

AlekzNet avatar Apr 07 '16 22:04 AlekzNet

Justin, just curious, how big is the checkpoint file for this 3x2068? E.g. torch.saving a (not-reset/cleared, with all grads, etc.) 13.2GB NN takes 130-140sec on my system, what does look like it stands still.

AlekzNet avatar Apr 07 '16 22:04 AlekzNet

The .t7 checkpoint is 659MB, and takes maybe 10 seconds to save on my system.

On Thu, Apr 7, 2016 at 3:38 PM, AlekzNet [email protected] wrote:

Justin, just curious, how big is the checkpoint file for this 3x2068? E.g. saving a (not-reset/cleared, with all grads, etc.) 10.2GB NN takes 130-140sec on my system.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/jcjohnson/torch-rnn/issues/59#issuecomment-207122125

jcjohnson avatar Apr 07 '16 22:04 jcjohnson

@rubinovitz What does [h]top say?

AlekzNet avatar Apr 07 '16 22:04 AlekzNet

I'm having the same issue, running

th train.lua -input_h5 my_data.h5 -input_json my_data.json -model_type rnn -num_layers 3 -rnn_size 256

where my_data.h5 is about 500mb. th hangs indefinitely when reaching the checkpoint. This is when running with -memory_benchmark 0. If I set it to 1 I get

Running with OpenCL on GPU 0
/home/kejace/torch/torch-cl/install/bin/luajit: train.lua:112: assertion failed!
stack traceback:
        [C]: in function 'assert'
        train.lua:112: in main chunk
        [C]: in function 'dofile'
        ...h/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670

Note that I'm using cltorch.

htop tells me that it is still using CPU:

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
25241 kejace     20   0 5571M 1134M 53308 S 106. 29.0  1:30.39 /home/kejace/torch/torch-c
25401 kejace     20   0 5571M 1136M 53344 R 90.2 29.0  1:21.29 /home/kejace/torch/torch-c

kejace avatar Jun 12 '16 04:06 kejace

Having the same issue here Using cltorch. Taking forever to save checkpoints. It takes about a minute to save a checkpoint for tiny-shakespeare input and forever (was waiting for about an hour) to save a checkpoint for a 48 MB input file. It is still doing something since its loading CPU and GPU while trying to save that checkpoint:

gerbill@ubgerbill:~$ aticonfig --odgc --odgt

Default Adapter - Supported device 67B1
                            Core (MHz)    Memory (MHz)
           Current Clocks :    947           1250
             Current Peak :    947           1250
  Configurable Peak Range : [300-1500]     [150-2000]
                 GPU load :    100%

Default Adapter - Supported device 67B1
                  Sensor 0: Temperature - 92.00 C

Previously used amazon aws GPU instance (Nvidia GPU powered) and I haven't experienced anything like that, everything was working smooth.

BTW, I am having the same issue as kejace has when setting -memory_benchmark to 1

gerbill avatar Jun 23 '16 17:06 gerbill

I wonder if this is an OpenCL related issue? @rubinovitz did you run this on NVIDIA or AMD?

kejace avatar Jun 23 '16 17:06 kejace

I'm guessing its this loop taking so long to process

    for j = 1, num_val do
      local xv, yv = loader:nextBatch('val')
      xv = xv:type(dtype)
      yv = yv:type(dtype):view(N * T)
      local scores = model:forward(xv):view(N * T, -1)
      val_loss = val_loss + crit:forward(scores, yv)
    end

If I set num_val = 2 the script will no longer loiter and save checkpoint files in about 10 seconds. Not sure what val_loss means in the end of the day. Will this value be needed for text sampling later? Or is it outputted just to inform me of some avarage loss over the previous iterations?

gerbill avatar Jun 23 '16 18:06 gerbill

You don't need the validation loss for sampling, but comparing validation loss with training loss is a good way to check whether your model is overfitting. To make this part go faster you can use a smaller validation set (by setting a smaller value for --val_frac in the preprocess.py script).

jcjohnson avatar Jun 23 '16 18:06 jcjohnson