char-rnn icon indicating copy to clipboard operation
char-rnn copied to clipboard

code for init_from in training has few bugs

Open udibr opened this issue 10 years ago • 6 comments

In line https://github.com/karpathy/char-rnn/blob/master/train.lua#L127 there is a Lua syntax mistake

the checkpoint vocab may be smaller than input vocab and pass the test

the checkpoint may have a model which is not lstm

udibr avatar Nov 24 '15 23:11 udibr

Thanks, this looks like what was causing my "bug", (which was well discouraging) , where training loss kept going to infinity, even with new data.. I'll retest, and see if it helps my case.

wrapperband avatar Nov 25 '15 11:11 wrapperband

if you are using a different input.txt when re-running the training code with an old model (using -init_from flag) then you need more changes to the code. I have support for this on my develop branch https://github.com/udibr/char-rnn/tree/develop

also if the new input.txt file is in the same directory where the old file was then you should delete the data.t7 and vocab.t7

udibr avatar Nov 25 '15 13:11 udibr

Thanks for the update, sounds like the new inputs also caused me some of the problems. But I always created new data.tz and vocab.t7 for updated input, so that didn't.

wrapperband avatar Nov 25 '15 13:11 wrapperband

I'm setting up a new PC, with R9 290, which might be causing other problems, 15.10.
Bearing that in mind, I tested the udibr's development version and got the same / similar errors. With either char-rnn version I can't start a new net.

I had 2 other errors, one re "? (in a diamond) characters when creating the data.t7 and vocab.t7.
I had trouble getting those as the graphics crashes and you loose the window top menu bars (KUbuntu).
I have re-installed torch etc a couple of times, but will do a complete reinstall next, if no other ideas. I have already swapped checkpoints between R9 270 and HD 6970, so moving to the R9 290 should be OK. I've done a couple of driver re-instals.

th train.lua -data_dir ~/programs/char-rnn/data/songster11 -opencl 1 -gpuid 0 -init_from cv/Songster3-0-02.t7 -dropout .5 -seed 97 -eval_val_every 1200 -savefile 'Songster4-1-6.95-286' -max_epochs 1 -train_frac 0.95 -val_frac 0.05

th train.lua -data_dir ~/programs/char-rnn/data/songster11 -opencl 1 -seq_length 180 -rnn_size 700 -num_layers 4 -max_epochs 50 -savefile 'Songster4-0.94' -eval_val_every 2000 -train_frac 0.945 -val_frac 0.05

user@marvin-songster:~/programs/char-rnn$ ./songster.sh
using OpenCL on GPU 0... loading data files... cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 1395, val: 74, test: 0 vocab size: 114 loading a model from checkpoint cv/Songster3-0-02.t7 Using Advanced Micro Devices, Inc. , OpenCL platform: AMD Accelerated Parallel Processing Using OpenCL device: Hawaii checkpoint_vocab_size: 113 /home/user/torch/install/bin/luajit: train.lua:137: error, the character vocabulary for this dataset and the one in the saved checkpoint are not the same. This is trouble. stack traceback: [C]: in function 'assert' train.lua:137: in main chunk [C]: in function 'dofile' ...user/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00405d70

user@marvin-songster:~/programs/char-rnn$ ./starter.sh
using OpenCL on GPU 0... loading data files... cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 385, val: 20, test: 3 vocab size: 114 creating an lstm with 4 layers Using Advanced Micro Devices, Inc. , OpenCL platform: AMD Accelerated Parallel Processing Using OpenCL device: Hawaii setting forget gate biases to 1 in LSTM layer 1 setting forget gate biases to 1 in LSTM layer 2 setting forget gate biases to 1 in LSTM layer 3 setting forget gate biases to 1 in LSTM layer 4 number of parameters in the model: 14141514 cloning rnn cloning criterion /home/user/torch/install/bin/luajit: /home/user/torch/install/share/lua/5.1/nn/CAddTable.lua:21: Error: copyTo failed with -4 at /tmp/luarocks_cltorch-scm-1-458/cltorch/cltorch/src/lib/THClTensorCopy.cpp:162 stack traceback: [C]: in function 'copy' /home/user/torch/install/share/lua/5.1/nn/CAddTable.lua:21: in function 'updateGradInput' /home/user/torch/install/share/lua/5.1/nngraph/gmodule.lua:327: in function 'neteval' /home/user/torch/install/share/lua/5.1/nngraph/gmodule.lua:361: in function 'updateGradInput' /home/user/torch/install/share/lua/5.1/nn/Module.lua:30: in function 'backward' train.lua:284: in function 'opfunc' /home/user/torch/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop' train.lua:314: in main chunk [C]: in function 'dofile' ...user/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00405d70

wrapperband avatar Nov 30 '15 17:11 wrapperband

Here's the error message restarting from checkpoint with using -init_from flag and the updated version

user@marvin-songster:~/programs/char-rnn1$ ./songster.sh using OpenCL on GPU 0... loading a model from checkpoint cv/lm_Songster4-1-6.95-286_epoch1.00_1.6287.t7 Using Advanced Micro Devices, Inc. , OpenCL platform: AMD Accelerated Parallel Processing Using OpenCL device: Hawaii overwriting rnn_size=700, num_layers=4, model=lstm based on the checkpoint. vocab.t7 and data.t7 do not exist. Running preprocessing... one-time setup: preprocessing input text file /home/user/programs/char-rnn1/data/songster11/input.txt... loading text file... creating vocabulary mapping... putting data into tensor... /home/user/torch/install/bin/luajit: ./util/CharSplitLMMinibatchLoader.lua:171: char "� not in dictionary stack traceback: [C]: in function 'assert' ./util/CharSplitLMMinibatchLoader.lua:171: in function 'text_to_tensor' ./util/CharSplitLMMinibatchLoader.lua:38: in function 'create' train.lua:141: in main chunk [C]: in function 'dofile' ...user/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00405d70 user@marvin-songster:~/programs/char-rnn1$

wrapperband avatar Nov 30 '15 19:11 wrapperband

Retraining has also this bug -> https://github.com/karpathy/char-rnn/issues/137

Atcold avatar Dec 04 '15 21:12 Atcold