neuralconvo
neuralconvo copied to clipboard
opencl training fail
I have never be successful on training.
th train.lua --opencl --dataset 50000 --hiddenSize 1000
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...
Dataset stats:
Vocabulary size: 25931
Examples: 83632
libthclnn_searchpath /Users/SolarKing/Dev/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: GeForce 9400M
-- Epoch 1 / 50
/Users/SolarKing/Dev/torch/install/bin/luajit: ...larKing/Dev/torch/install/share/lua/5.1/nn/Container.lua:67: In 1 module of nn.Sequential: bad argument #3 to '?' (number expected, got nil) stack traceback: [C]: at 0x0ebe4500 [C]: in function '__newindex' .../Dev/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function <.../Dev/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:99> [C]: in function 'xpcall' ...larKing/Dev/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' ...arKing/Dev/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./seq2seq.lua:71: in function 'train' train.lua:85: in main chunk [C]: in function 'dofile' .../Dev/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x010e8bbbb0
WARNING: If you see a stack trace below, it doesn't point to the place where this error occured. Please use only the one above. stack traceback: [C]: in function 'error' ...larKing/Dev/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' ...arKing/Dev/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./seq2seq.lua:71: in function 'train' train.lua:85: in main chunk [C]: in function 'dofile' .../Dev/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x010e8bbbb0
I'm also using torch-cl, following the tutorial there, you shouldn't install nn
, cudnn
, cldnn
, etc. because it break the installation. The only things I installed were rnn
and penlight
.
Got something similar:
lerk@blrg:~/workspace/neuralconvo$ th train.lua --opencl
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...
Dataset stats:
Vocabulary size: 35147
Examples: 221282
libthclnn_searchpath /home/lerk/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce GTX 660
-- Epoch 1 / 50
/home/lerk/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
[C]: at 0x7f142d00baa0
[C]: in function '__newindex'
...lerk/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
/home/lerk/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./seq2seq.lua:66: in function 'train'
train.lua:88: in main chunk
[C]: in function 'dofile'
...k/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405e90
UPDATE: I also tried this on my MacBook Pro, same error:
lerk@blackreach ~/workspace/neuralconvo [14:33:45]
> $ th train.lua --opencl [±master ✓]
-- Loading dataset
data/vocab.t7 not found
-- Parsing Cornell movie dialogs data set ...
[=============================================================== 387810/387810 =======>] Tot: 1s615ms | Step: 0ms
-- Pre-processing data
[================================================================ 166194/166194 ======>] Tot: 31s885ms | Step: 0ms
-- Removing low frequency words
[================================================================ 221282/221282 ======>] Tot: 14s809ms | Step: 0ms
Writing data/examples.t7 ...
[=============================================================== 221282/221282 =======>] Tot: 33s43ms | Step: 0ms
Writing data/vocab.t7 ...
Dataset stats:
Vocabulary size: 35147
Examples: 221282
libthclnn_searchpath /Users/lerk/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: ATI Radeon HD 6770M
-- Epoch 1 / 50
/Users/lerk/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
[C]: at 0x05350f40
[C]: in function '__newindex'
...lerk/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
...rs/lerk/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./seq2seq.lua:66: in function 'train'
train.lua:88: in main chunk
[C]: in function 'dofile'
...k/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x0105008d00
The stacktrace suggests that the Error is on this line:
local encoderOutput = self.encoder:forward(encoderInputs)
~~I tried to locate the error and I think it's on line 88 in train.lua
:~~
model:train(encInputs, decInputs, decTargets)
~~Could it be that #29 introduced this? It's the latest change on this line. previously it was:~~
local err = model:train(input, target)
~~I'll try to fix this somehow (I don't even know lua) and get back here then.~~
UPDATE: I checked out the last commit before the merge and I got the same error again. Only the hex numbers differ:
lerk@blrg:~/workspace/neuralconvo$ th train.lua --opencl
-- Loading dataset
data/vocab.t7 not found
-- Parsing Cornell movie dialogs data set ...
[=============================================================== 387810/387810 =======>] Tot: 1s391ms | Step: 0ms
-- Pre-processing data
[============================================================= 166194/166194 =========>] Tot: 5m14s | Step: 0ms
-- Removing low frequency words
[============================================================ 221282/221282 ==========>] Tot: 7m6s | Step: 1ms
Writing data/examples.t7 ...
[============================================================ 221282/221282 ==========>] Tot: 7m4s | Step: 5ms
Writing data/vocab.t7 ...
Dataset stats:
Vocabulary size: 35147
Examples: 221282
libthclnn_searchpath /home/lerk/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce GTX 660
-- Epoch 1 / 50
/home/lerk/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
[C]: at 0x7fcd865b3aa0
[C]: in function '__newindex'
...lerk/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
/home/lerk/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./seq2seq.lua:66: in function 'train'
train.lua:88: in main chunk
[C]: in function 'dofile'
...k/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405e90
I am hitting this as well. I think something changed in the last month or so where the cltorch
and clnn
modules are no longer supported via luarocks
. Instead you have to use the torch-cl
distro.
The problem is coming from train.lua:70 in model:getParameters()
. That no longer returns the parameters. I'm still looking into it.
@mgomes you got anything so far?
I tried this again and stumbled upon the following part in the official torch source:
function optim.adam(opfunc, x, config, state)
-- (0) get/update state
local config = config or {}
local state = state or config
local lr = config.learningRate or 0.001
local lrd = config.learningRateDecay or 0
local beta1 = config.beta1 or 0.9
local beta2 = config.beta2 or 0.999
local epsilon = config.epsilon or 1e-8
In the stacktrace, the epsilon allocation is mentioned to being a nil value while expecting a number. I assume that in cltoroch (distro-cl) there is no default value for this but I am unable to find the file in cltorch.
The config object that gets passed to the function above is the following:
{
momentum : 0.9
learningRate : 0.001
}
Here's another stacktrace:
> $ th train.lua --opencl [±master ●]
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...
Dataset stats:
Vocabulary size: 35147
Examples: 221282
libthclnn_searchpath /Users/lfuelling/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: Iris Pro
-- Epoch 1 / 50 (LR= 0.001)
{
momentum : 0.9
learningRate : 0.001
}
/Users/lfuelling/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
[C]: at 0x01deef20
[C]: in function '__newindex'
...ling/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
...uelling/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
train.lua:93: in function 'opfunc'
.../lfuelling/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
train.lua:129: in main chunk
[C]: in function 'dofile'
...g/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x0101aa6cf0
UPDATE: I'm stupid. If you read the stacktrace, you'll notice Using OpenCL device: Iris Pro
. I bet it works when I use the external GPU. neural-style has an option to set the GPU you want. I'll try to implement this.
Ok, I fixed a bunch of bugs yesterday. I think the easiest thing to do will be to simply reinstall distro-cl
, since there were a bunch of fixes, and specifically, rnn
is pinned now, via rocks-cl, which implies a change to your torch-cl/install/etc/luarocks/config.lua
file, to have one adiditonal rocks_server, as follows:
rocks_servers = {
[[https://raw.githubusercontent.com/hughperkins/rocks-cl/master]],
[[https://raw.githubusercontent.com/torch/rocks/master]],
[[https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master]]
}
There was also a change to the exe/luajit-rocks
submodule, to point to https://github.com/hughperkins/luajit-rocks , to hold this configuration.
I just now tested a full fresh reinstallation, using hte following commands:
git clone --recursive https://github.com/hughperkins/distro-cl torch-cl
cd torch-cl
bash install.sh -b
source /data/torch-cl/install/bin/torch-activate # normally this would be ~/torch-cl/... for you
luarocks install rnn
luarocks install torchx
cd /data/git/neuralconvo
bfboost client -r th train.lua --opencl # you wont need/want the `bfboost client -r` bit, this is just because I'm running on bfboost
# et voila, running, see screenshot
Screenshot:
For those too lazy to read the file: -b doesn't prompt for anything. Watch your .whateverrc after the install to remove duplicate entries of torch-activate
.
its not working yet .... I'm still trying to fix it. I got as far as maskedSelect
being implemented, but it currently causes a segfault under the present scenario, which I need to look into. I think you might as well leave this open for now really?
I think it was automatically closed. Ping @macournoyer
Ooops! Autoclosed indeed.
Might be working now. Can you pull down latest updates to distro-cl
, and retry?