neuralconvo icon indicating copy to clipboard operation
neuralconvo copied to clipboard

opencl training fail

Open SolarPeng opened this issue 8 years ago • 10 comments

I have never be successful on training.

th train.lua --opencl --dataset 50000 --hiddenSize 1000

-- Loading dataset
Loading vocabulary from data/vocab.t7 ...

Dataset stats:
Vocabulary size: 25931
Examples: 83632
libthclnn_searchpath /Users/SolarKing/Dev/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple Using OpenCL device: GeForce 9400M

-- Epoch 1 / 50

/Users/SolarKing/Dev/torch/install/bin/luajit: ...larKing/Dev/torch/install/share/lua/5.1/nn/Container.lua:67: In 1 module of nn.Sequential: bad argument #3 to '?' (number expected, got nil) stack traceback: [C]: at 0x0ebe4500 [C]: in function '__newindex' .../Dev/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function <.../Dev/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:99> [C]: in function 'xpcall' ...larKing/Dev/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' ...arKing/Dev/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./seq2seq.lua:71: in function 'train' train.lua:85: in main chunk [C]: in function 'dofile' .../Dev/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x010e8bbbb0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occured. Please use only the one above. stack traceback: [C]: in function 'error' ...larKing/Dev/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' ...arKing/Dev/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./seq2seq.lua:71: in function 'train' train.lua:85: in main chunk [C]: in function 'dofile' .../Dev/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x010e8bbbb0

SolarPeng avatar Jun 01 '16 08:06 SolarPeng

I'm also using torch-cl, following the tutorial there, you shouldn't install nn, cudnn, cldnn, etc. because it break the installation. The only things I installed were rnn and penlight.

Got something similar:

lerk@blrg:~/workspace/neuralconvo$ th train.lua --opencl
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /home/lerk/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce GTX 660

-- Epoch 1 / 50

/home/lerk/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x7f142d00baa0
    [C]: in function '__newindex'
    ...lerk/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    /home/lerk/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./seq2seq.lua:66: in function 'train'
    train.lua:88: in main chunk
    [C]: in function 'dofile'
    ...k/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00405e90

UPDATE: I also tried this on my MacBook Pro, same error:

lerk@blackreach ~/workspace/neuralconvo                                                                                                         [14:33:45]
> $ th train.lua --opencl                                                                                                                      [±master ✓]
-- Loading dataset
data/vocab.t7 not found
-- Parsing Cornell movie dialogs data set ...
 [=============================================================== 387810/387810 =======>] Tot: 1s615ms | Step: 0ms
-- Pre-processing data
 [================================================================ 166194/166194 ======>] Tot: 31s885ms | Step: 0ms
-- Removing low frequency words
 [================================================================ 221282/221282 ======>] Tot: 14s809ms | Step: 0ms
Writing data/examples.t7 ...
 [=============================================================== 221282/221282 =======>] Tot: 33s43ms | Step: 0ms
Writing data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /Users/lerk/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: ATI Radeon HD 6770M

-- Epoch 1 / 50

/Users/lerk/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x05350f40
    [C]: in function '__newindex'
    ...lerk/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    ...rs/lerk/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./seq2seq.lua:66: in function 'train'
    train.lua:88: in main chunk
    [C]: in function 'dofile'
    ...k/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x0105008d00

lfuelling avatar Jun 20 '16 11:06 lfuelling

The stacktrace suggests that the Error is on this line:

local encoderOutput = self.encoder:forward(encoderInputs)

~~I tried to locate the error and I think it's on line 88 in train.lua:~~

model:train(encInputs, decInputs, decTargets)

~~Could it be that #29 introduced this? It's the latest change on this line. previously it was:~~

local err = model:train(input, target)

~~I'll try to fix this somehow (I don't even know lua) and get back here then.~~

UPDATE: I checked out the last commit before the merge and I got the same error again. Only the hex numbers differ:

lerk@blrg:~/workspace/neuralconvo$ th train.lua --opencl
-- Loading dataset
data/vocab.t7 not found
-- Parsing Cornell movie dialogs data set ...
 [=============================================================== 387810/387810 =======>] Tot: 1s391ms | Step: 0ms
-- Pre-processing data
 [============================================================= 166194/166194 =========>] Tot: 5m14s | Step: 0ms
-- Removing low frequency words
 [============================================================ 221282/221282 ==========>] Tot: 7m6s | Step: 1ms
Writing data/examples.t7 ...
 [============================================================ 221282/221282 ==========>] Tot: 7m4s | Step: 5ms
Writing data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /home/lerk/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce GTX 660

-- Epoch 1 / 50

/home/lerk/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x7fcd865b3aa0
    [C]: in function '__newindex'
    ...lerk/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    /home/lerk/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./seq2seq.lua:66: in function 'train'
    train.lua:88: in main chunk
    [C]: in function 'dofile'
    ...k/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00405e90

lfuelling avatar Jun 20 '16 15:06 lfuelling

I am hitting this as well. I think something changed in the last month or so where the cltorch and clnn modules are no longer supported via luarocks. Instead you have to use the torch-cl distro.

The problem is coming from train.lua:70 in model:getParameters(). That no longer returns the parameters. I'm still looking into it.

mgomes avatar Jul 07 '16 14:07 mgomes

@mgomes you got anything so far?

I tried this again and stumbled upon the following part in the official torch source:

function optim.adam(opfunc, x, config, state)
   -- (0) get/update state
   local config = config or {}
   local state = state or config
   local lr = config.learningRate or 0.001
   local lrd = config.learningRateDecay or 0

   local beta1 = config.beta1 or 0.9
   local beta2 = config.beta2 or 0.999
   local epsilon = config.epsilon or 1e-8

In the stacktrace, the epsilon allocation is mentioned to being a nil value while expecting a number. I assume that in cltoroch (distro-cl) there is no default value for this but I am unable to find the file in cltorch.

The config object that gets passed to the function above is the following:

{
  momentum : 0.9
  learningRate : 0.001
}

Here's another stacktrace:

> $ th train.lua --opencl                                                                                           [±master ●]
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /Users/lfuelling/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: Iris Pro

-- Epoch 1 / 50  (LR= 0.001)

{
  momentum : 0.9
  learningRate : 0.001
}
/Users/lfuelling/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x01deef20
    [C]: in function '__newindex'
    ...ling/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    ...uelling/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    train.lua:93: in function 'opfunc'
    .../lfuelling/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
    train.lua:129: in main chunk
    [C]: in function 'dofile'
    ...g/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x0101aa6cf0

UPDATE: I'm stupid. If you read the stacktrace, you'll notice Using OpenCL device: Iris Pro. I bet it works when I use the external GPU. neural-style has an option to set the GPU you want. I'll try to implement this.

lfuelling avatar Aug 18 '16 09:08 lfuelling

Ok, I fixed a bunch of bugs yesterday. I think the easiest thing to do will be to simply reinstall distro-cl, since there were a bunch of fixes, and specifically, rnn is pinned now, via rocks-cl, which implies a change to your torch-cl/install/etc/luarocks/config.lua file, to have one adiditonal rocks_server, as follows:

rocks_servers = {
   [[https://raw.githubusercontent.com/hughperkins/rocks-cl/master]],
   [[https://raw.githubusercontent.com/torch/rocks/master]],
   [[https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master]]
}

There was also a change to the exe/luajit-rocks submodule, to point to https://github.com/hughperkins/luajit-rocks , to hold this configuration.

I just now tested a full fresh reinstallation, using hte following commands:

git clone --recursive https://github.com/hughperkins/distro-cl torch-cl
cd torch-cl
bash install.sh -b
source /data/torch-cl/install/bin/torch-activate  # normally this would be ~/torch-cl/... for you
luarocks install rnn
luarocks install torchx
cd /data/git/neuralconvo
bfboost client -r th train.lua --opencl   # you wont need/want the `bfboost client -r` bit, this is just because I'm running on bfboost
# et voila, running, see screenshot

Screenshot: neuralconvo3

hughperkins avatar Aug 21 '16 12:08 hughperkins

For those too lazy to read the file: -b doesn't prompt for anything. Watch your .whateverrc after the install to remove duplicate entries of torch-activate.

lfuelling avatar Aug 22 '16 17:08 lfuelling

its not working yet .... I'm still trying to fix it. I got as far as maskedSelect being implemented, but it currently causes a segfault under the present scenario, which I need to look into. I think you might as well leave this open for now really?

hughperkins avatar Aug 23 '16 12:08 hughperkins

I think it was automatically closed. Ping @macournoyer

lfuelling avatar Aug 23 '16 13:08 lfuelling

Ooops! Autoclosed indeed.

macournoyer avatar Aug 23 '16 13:08 macournoyer

Might be working now. Can you pull down latest updates to distro-cl, and retry?

hughperkins avatar Aug 25 '16 13:08 hughperkins