dress
dress copied to clipboard
An error training an Encoder-Decoder Attention Model
When I train an Encoder-Decoder Attention Model using "sh run_std.sh", I get the following error:
/home/qiang/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [56,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize
failed.
THCudaCheck FAIL file=/home/qiang/torch/extra/cutorch/lib/THC/generic/THCStorage.c line=32 error=59 : device-side assert triggered
/home/qiang/torch/install/bin/luajit: cuda runtime error (59) : device-side assert triggered at /home/qiang/torch/extra/cutorch/lib/THC/generic/THCStorage.c:32
stack traceback:
[C]: at 0x7fbc8f5b6050
[C]: in function '__index'
layers/EMaskedClassNLLCriterion.lua:18: in function 'forward'
nnets/EncDecAWE.lua:391: in function 'opfunc'
/home/qiang/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'optimMethod'
nnets/EncDecAWE.lua:468: in function 'trainBatch'
train.lua:40: in function 'train'
train.lua:162: in function 'main'
train.lua:269: in function 'main'
train.lua:272: in main chunk
[C]: in function 'dofile'
...iang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405e90
Lock freed
Usage instructions:
To obtain and lock an id: ./gpu_lock.py --id The lock is automatically freed when the parent terminates
To get an id that won't be freed: ./gpu_lock.py --id-to-hog
You must manually free these ids: ./gpu_lock.py --free
More info: http://homepages.inf.ed.ac.uk/imurray2/code/gpu_monitoring/
If you change to CPU mode and you can see more clearly the error comes from. One of bug I fixed is maybe because the author uses an older version of Torch. I fix my bug by replacing float to double.
Hi @qiang2100 ! I am encountering the same error, did you find what is causing it?