CFO
CFO copied to clipboard
Issues for RelationRNN training, maxSeqLen and zero loss or infinity loss
If using the default maxSeqLen
, one will get the cublas runtime error
➜ RelationRNN git:(master) ✗ th train_rel_rnn.lua
[INFO - 2018_05_02_20:11:11] - "--------------------------------------------------"
[INFO - 2018_05_02_20:11:11] - "SeqRankingLoader Configurations:"
[INFO - 2018_05_02_20:11:11] - " number of batch : 296"
[INFO - 2018_05_02_20:11:11] - " data batch size : 256"
[INFO - 2018_05_02_20:11:11] - " neg sample size : 1024"
[INFO - 2018_05_02_20:11:11] - " neg sample range: 7524"
[INFO - 2018_05_02_20:11:11] - "--------------------------------------------------"
[INFO - 2018_05_02_20:11:11] - "BiGRU Configuration:"
[INFO - 2018_05_02_20:11:11] - " inputSize : 300"
[INFO - 2018_05_02_20:11:11] - " hiddenSize : 256"
[INFO - 2018_05_02_20:11:11] - " maxSeqLen : 40"
[INFO - 2018_05_02_20:11:11] - " maxBatch : 256"
[INFO - 2018_05_02_20:11:11] - "--------------------------------------------------"
[INFO - 2018_05_02_20:11:11] - "BiGRU Configuration:"
[INFO - 2018_05_02_20:11:11] - " inputSize : 512"
[INFO - 2018_05_02_20:11:11] - " hiddenSize : 256"
[INFO - 2018_05_02_20:11:11] - " maxSeqLen : 40"
[INFO - 2018_05_02_20:11:11] - " maxBatch : 256"
/home/vimos/.torch/install/bin/luajit: /home/vimos/.torch/install/share/lua/5.1/nn/Container.lua:67:
In 5 module of nn.Sequential:
/home/vimos/Data/git/QA/CFO/src/model/BiGRU.lua:241: cublas runtime error : an internal operation failed at /home/vimos/.torch/extra/cutorch/lib/THC/THCBlas.cu:246
stack traceback:
[C]: in function 'mm'
/home/vimos/Data/git/QA/CFO/src/model/BiGRU.lua:241: in function 'updateGradInput'
/home/vimos/.torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/vimos/.torch/install/share/lua/5.1/nn/Module.lua:29>
[C]: in function 'xpcall'
/home/vimos/.torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/vimos/.torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
train_rel_rnn.lua:174: in main chunk
[C]: in function 'dofile'
...mos/.torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x559ae9bad710
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/vimos/.torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/home/vimos/.torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
train_rel_rnn.lua:174: in main chunk
[C]: in function 'dofile'
...mos/.torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x559ae9bad710
THCudaCheckWarn FAIL file=/home/vimos/.torch/extra/cutorch/lib/THC/THCStream.cpp line=50 error=77 : an illegal memory access was encountered
THCudaCheckWarn FAIL file=/home/vimos/.torch/extra/cutorch/lib/THC/THCStream.cpp line=50 error=77 : an illegal memory access was encountered
But this problem can be fixed by using a larger maxSeqLen
➜ RelationRNN git:(master) ✗ th train_rel_rnn.lua -maxSeqLen 42
[INFO - 2018_05_02_20:11:52] - "--------------------------------------------------"
[INFO - 2018_05_02_20:11:52] - "SeqRankingLoader Configurations:"
[INFO - 2018_05_02_20:11:52] - " number of batch : 296"
[INFO - 2018_05_02_20:11:52] - " data batch size : 256"
[INFO - 2018_05_02_20:11:52] - " neg sample size : 1024"
[INFO - 2018_05_02_20:11:52] - " neg sample range: 7524"
[INFO - 2018_05_02_20:11:52] - "--------------------------------------------------"
[INFO - 2018_05_02_20:11:52] - "BiGRU Configuration:"
[INFO - 2018_05_02_20:11:52] - " inputSize : 300"
[INFO - 2018_05_02_20:11:52] - " hiddenSize : 256"
[INFO - 2018_05_02_20:11:52] - " maxSeqLen : 42"
[INFO - 2018_05_02_20:11:52] - " maxBatch : 256"
[INFO - 2018_05_02_20:11:52] - "--------------------------------------------------"
[INFO - 2018_05_02_20:11:52] - "BiGRU Configuration:"
[INFO - 2018_05_02_20:11:52] - " inputSize : 512"
[INFO - 2018_05_02_20:11:52] - " hiddenSize : 256"
[INFO - 2018_05_02_20:11:52] - " maxSeqLen : 42"
[INFO - 2018_05_02_20:11:52] - " maxBatch : 256"
[INFO - 2018_05_02_20:11:56] - "iter 100, loss = 0.00198258"........] ETA: 3h29m | Step: 42ms
[INFO - 2018_05_02_20:12:00] - "iter 200, loss = 0.00000000"........] ETA: 3h25m | Step: 41ms
[INFO - 2018_05_02_20:12:04] - "epoch 1, loss 0.00066979"..........] ETA: 3h28m | Step: 42ms
[INFO - 2018_05_02_20:12:04] - "iter 300, loss = 0.00000000"........] ETA: 3h27m | Step: 42ms
[INFO - 2018_05_02_20:12:09] - "iter 400, loss = 0.00000000"........] ETA: 3h28m | Step: 42ms
[INFO - 2018_05_02_20:12:13] - "iter 500, loss = 0.00000000"........] ETA: 3h25m | Step: 41ms
[INFO - 2018_05_02_20:12:17] - "epoch 2, loss 0.00000000"..........] ETA: 3h26m | Step: 41ms
[INFO - 2018_05_02_20:12:17] - "iter 600, loss = 0.00000000"........] ETA: 3h26m | Step: 41ms
[INFO - 2018_05_02_20:12:21] - "iter 700, loss = 0.00000000"........] ETA: 3h28m | Step: 42ms
[INFO - 2018_05_02_20:12:25] - "iter 800, loss = 0.00000000"........] ETA: 3h27m | Step: 42ms
[INFO - 2018_05_02_20:12:29] - "epoch 3, loss 0.00000000"..........] ETA: 3h25m | Step: 41ms
[INFO - 2018_05_02_20:12:30] - "iter 900, loss = 0.00000000"........] ETA: 3h25m | Step: 41ms
[INFO - 2018_05_02_20:12:34] - "iter 1000, loss = 0.00000000"........] ETA: 3h27m | Step: 42ms
But the loss will be 0 after the 1 epoch or become infinity
➜ RelationRNN git:(master) ✗ th train_rel_rnn.lua -maxSeqLen 42 -seed 12
[INFO - 2018_05_02_20:26:49] - "--------------------------------------------------"
[INFO - 2018_05_02_20:26:49] - "SeqRankingLoader Configurations:"
[INFO - 2018_05_02_20:26:49] - " number of batch : 296"
[INFO - 2018_05_02_20:26:49] - " data batch size : 256"
[INFO - 2018_05_02_20:26:49] - " neg sample size : 1024"
[INFO - 2018_05_02_20:26:49] - " neg sample range: 7524"
[INFO - 2018_05_02_20:26:49] - "--------------------------------------------------"
[INFO - 2018_05_02_20:26:49] - "BiGRU Configuration:"
[INFO - 2018_05_02_20:26:49] - " inputSize : 300"
[INFO - 2018_05_02_20:26:49] - " hiddenSize : 256"
[INFO - 2018_05_02_20:26:49] - " maxSeqLen : 42"
[INFO - 2018_05_02_20:26:49] - " maxBatch : 256"
[INFO - 2018_05_02_20:26:49] - "--------------------------------------------------"
[INFO - 2018_05_02_20:26:49] - "BiGRU Configuration:"
[INFO - 2018_05_02_20:26:49] - " inputSize : 512"
[INFO - 2018_05_02_20:26:49] - " hiddenSize : 256"
[INFO - 2018_05_02_20:26:49] - " maxSeqLen : 42"
[INFO - 2018_05_02_20:26:49] - " maxBatch : 256"
[INFO - 2018_05_02_20:26:53] - "iter 100, loss = 81231552070126006809284050944.00000000" 41ms
[INFO - 2018_05_02_20:26:57] - "iter 200, loss = 0.00000000"........] ETA: 3h15m | Step: 39ms
[INFO - 2018_05_02_20:27:01] - "epoch 1, loss 27443091915583111597203128320.00000000"p: 40ms
[INFO - 2018_05_02_20:27:01] - "iter 300, loss = 0.00000000"........] ETA: 3h17m | Step: 40ms
I have not ran the codes,but I wanna know how many available data(subject mention could be found in the question) in the train(75910)&test(21678) can I get after preprocessing.Would you mind solving my problem?And I will be appreciated.