docker-torch-rnn
docker-torch-rnn copied to clipboard
'THCudaCheck FAIL' Using Cuda7.5 Docker Image
After installing the NVIDIA docker image, and loading the Torch RNN docker via:
nvidia-docker run --rm -ti crisbal/torch-rnn:cuda7.5 bash
and preprocessing via
root@3da15ad69af8:~/torch-rnn# python scripts/preprocess.py --input_txt data/library.txt --output_h5 data/library.h5 --output_json data/library.json
Attempting to train the system results in the following:
root@3da15ad69af8:~/torch-rnn# th train.lua -input_h5 data/library.h5 -input_json data/library.json Running with CUDA on GPU 0
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9234/cutorch/lib/THC/THCGeneral.c line=608 error=8 : invalid device function /root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/nn/Container.lua:67: In 2 module of nn.Sequential: ./LSTM.lua:128: cuda runtime error (8) : invalid device function at /tmp/luarocks_cutorch-scm-1-9234/cutorch/lib/THC/THCGeneral.c:608 stack traceback: [C]: in function 'resize' ./LSTM.lua:128: in function <./LSTM.lua:118> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' train.lua:130: in function 'opfunc' /root/torch/install/share/lua/5.1/optim/adam.lua:33: in function 'adam' train.lua:187: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670WARNING: If you see a stack trace below, it doesn't point to the place where this error occured. Please use only the one above. stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' train.lua:130: in function 'opfunc' /root/torch/install/share/lua/5.1/optim/adam.lua:33: in function 'adam' train.lua:187: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670
I think this is an issue that one needs to report to the main torch-rnn repo (https://github.com/jcjohnson/torch-rnn) and not on this one.
First of all, are you for sure running a CUDA video card?
If yes, let's try something, what happens if you run nvidia-smi
inside the container?
Does it show any relevant info?
@crisbal thanks for the heads up--i will post this to the torch-rnn repo instead. For what its worth, i do have a gpu installed:
root@9be35619d034:~/torch-rnn# nvidia-smi Mon Jul 11 19:17:26 2016
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.27 Driver Version: 367.27 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 0000:01:00.0 On | N/A | | 28% 41C P8 7W / 180W | 725MiB / 8113MiB | 1% Default | +-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| +-----------------------------------------------------------------------------+
Let me know if in the end it is my fault or their :)
One random thought I had: since you have a 1080 maybe it uses some new kind of CUDA that maybe it is not well supported by either nvidia-docker or torch.
@crisbal it looks like the issue is that a newer version of CUDA is needed:
https://github.com/jcjohnson/torch-rnn/issues/122
Did you have any plans to make a CUDA8 version of the docker? Thanks for all the work you've done!
As soon as I get my hands on a Cuda machine and on fast Internet I will. Sorry I can't do it ASAP.
On Tue, Jul 12, 2016, 06:50 spadavec [email protected] wrote:
@crisbal https://github.com/crisbal it looks like the issue is that a newer version of CUDA is needed:
jcjohnson/torch-rnn#122 https://github.com/jcjohnson/torch-rnn/issues/122
Did you have any plans to make a CUDA8 version of the docker? Thanks for all the work you've done!
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/crisbal/docker-torch-rnn/issues/1#issuecomment-231936674, or mute the thread https://github.com/notifications/unsubscribe/ACmgZhqgcoWja9U4O3BL8Clff0Bd7u2iks5qUx0kgaJpZM4JGsle .
@spadavec I had the same issue and build this today https://hub.docker.com/r/xoryouyou/torch-rnn/
I got this error today as I'm using a 1080 and have cuda 8 installed.
@xoryouyou, I tried the command on the page you posted, but I'm getting an error
docker pull xoryouyou/torch-rnn Using default tag: latest Error response from daemon: manifest for xoryouyou/torch-rnn:latest not found
@HandsomeDevilv112 yeah the images it was only tagged as 1.0
and not latest
I updated it.
@xoryouyou: Cool! Much obliged. That seems to have done the trick. My apologies if there was a way for me to fix that myself and I just didn't catch it.
@xoryouyou Do you think you can share the Docker file also? I want to have a look on how you build your image.
I'm trying to use https://github.com/crisbal/docker-torch-rnn/blob/master/CUDA/8.0/Dockerfile
but it does not compile
It fails at this section:
RUN git clone https://github.com/jcjohnson/torch-rnn && \
pip install -r torch-rnn/requirements.txt
@valentinvieriu sorry I currently don't have access to that machine where I build the torch-rnn but i'll see if I can recreate your issue.
This is the issue that pops out:
'''
copying h5py/tests/hl/test_file.py -> build/lib.linux-x86_64-2.7/h5py/tests/hl
running build_ext
Traceback (most recent call last):
File "https://github.com/crisbal/docker-torch-rnn/blob/master/CUDA/8.0/Dockerfile
as said it fails at the:
RUN git clone https://github.com/jcjohnson/torch-rnn && \
pip install -r torch-rnn/requirements.txt
section
Any help is appreciated. I'm not very familiar with the dependencies, I plan only to use this as a tool.
Thank you @xoryouyou
Ok for future references, this fixed the building issue on ubuntu 16.04 replace
RUN git clone https://github.com/jcjohnson/torch-rnn && \
pip install -r torch-rnn/requirements.txt
from https://github.com/crisbal/docker-torch-rnn/blob/master/CUDA/8.0/Dockerfile
with
#torch-rnn and python requirements
# we use https://github.com/jcjohnson/torch-rnn/blob/master/requirements.txt as a quideline
WORKDIR /root
RUN apt-get install -y cython
RUN pip install --upgrade pip
RUN pip install Cython==0.23.4
RUN pip install numpy==1.10.4
RUN pip install argparse==1.2.1
RUN HDF5_DIR=/usr/lib/x86_64-linux-gnu/hdf5/serial/ pip install h5py==2.5.0
RUN pip install six==1.10.0
RUN git clone https://github.com/jcjohnson/torch-rnn
I will work on a Docker image and share it with the rest when it's finished
@valentinvieriu I am currently building with the crisbal/docker-torch-rnn
image on arch and it looks to build fine. Will report when done.
Build on Linux 4.12.8-2-ARCH #1 SMP PREEMPT Fri Aug 18 14:08:02 UTC 2017 x86_64 GNU/Linux
with Docker version 17.06.0-ce, build 3dfb8343
build_log.txt