char-rnn
char-rnn copied to clipboard
Cannot Train: attempt to call field 'ClassNLLCriterion_updateOutput' (a nil value)
Training used to work on my computer, though I hadn't tried it in several months. I've occasionally done some sampling, but no training since July (largely because my CPU was overheating, and CUDA is not a feasible option sadly).
Any attempt at training invariably gives me the following error in the nn library. Reinstalling Torch and the nngraph/optim/nn packages has not helped. I redownloaded the latest version of char-rnn into a new folder and that didn't help either. Any ideas on what I need to do to get this working?
It looks like the nn package isn't being initialised correctly in torch, but I don't know why/how.
The only thing Google has turned up so far is https://github.com/torch/nn/issues/122 and that isn't at all applicable. I can't even install cutorch/cunn on my system, as my hardware cannot support it. Maybe I should just get a new computer :P
ed@ed:~/char-rnn-master$ th train.lua -gpuid -1 -data_dir data/fora -rnn_size 512 -num_layers 2 -dropout 0.5
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 1414, val: 75, test: 0
vocab size: 187
creating an lstm with 2 layers
setting forget gate biases to 1 in LSTM layer 1
setting forget gate biases to 1 in LSTM layer 2
number of parameters in the model: 3632827
cloning rnn
cloning criterion
/home/ed/torch/install/bin/luajit: .../ed/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:44: attempt to call field 'ClassNLLCriterion_updateOutput' (a nil value)
stack traceback:
.../ed/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:44: in function 'forward'
train.lua:274: in function 'opfunc'
/home/ed/torch/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop'
train.lua:314: in main chunk
[C]: in function 'dofile'
...e/ed/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
I see you're not using cutorch/cunn. When it was working, were you using OpenGL (cltorch/clnn) or CPU (torch/nn generic)?
torch/nn generic seems to define ClassNLLCriterion_updateOutput here: https://github.com/torch/nn/blob/master/generic/ClassNLLCriterion.c and cunn defines it here: https://github.com/torch/cunn/blob/master/ClassNLLCriterion.cu but I can't find where clnn defines it. (It ought to be in https://github.com/hughperkins/clnn somewhere: https://github.com/hughperkins/clnn/search?q=ClassNLLCriterion_updateOutput )
If that were the only issue, it appears that a simple way to test it might be to try training a new model by invoking train.lua with -opencl 0 and -gpuid -1 (telling it to use the CPU) I know your CPU overheats, but the test would just be to quickly find out if it gets past this error.
I had it working with CPU (-gpuid -1) before.
FYI, using the python gist doesn't (quite) overhead because it only uses one core.
Does it still work with -gpuid -1 now? (That's the quick test I was asking you to perform.)
If yes, that implies you're trying to use OpenCL, but can't. Then I think the problem should be submitted at github.com/hughperkins/clnn , specifically, asking why there is no "ClassNLLCriterion_updateOutput" anywhere in that repository.
If no, that implies you'd be happy with anything, even CPU, but that's broken too. Then it's an issue for the torch/nn project, over at github.com/torch/nn and the question to ask is "why does generic/ClassNLLCriterion.c define ClassNLLCriterion_updateOutput but ClassNLLCriterion.lua does not see the definition?"
(And yes, I like the Python/numpy solution for similar reasons. I have a big desktop machine where I can have a few models training at once, and the computer is still extremely usable, stays cool, and the fan doesn't even run much. You can get my version at mrob.com/pub/comp/min-char-rnn.py.txt and read the block-comment for setup instructions. Same license as the original. )
I don't quite follow. In my original post, it shows that I used -gpuid -1 What do you want me to change?
Oh, I see now: I thought it was just a bunch of error messages. Yes, "th train.lua -gpuid -1" is right there. Sorry! :neutral_face:
Okay, that means it's an issue for the torch/nn project.