GRU4Rec icon indicating copy to clipboard operation
GRU4Rec copied to clipboard

theano error

Open Abigale001 opened this issue 4 years ago • 2 comments

I run this command: $python run.py /path/to/training_data_file -t /path/to/test_data_file -m 1 5 10 20 -ps loss=bpr-max,final_act=elu-.5,hidden_act=tanh,layers=100,adapt=adagrad,n_epochs=10,batch_size=32,dropout_p_embed=0.0,dropout_p_hidden=0.0,learning_rate=0.2,momentum=0.3,n_sample=2048,sample_alpha=0.0,bpreg=1.0,constrained_embedding=False

But I get this error:

/home/.../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/gpuarray/dnn.py:184: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to a version >= v5 and <= v7. warnings.warn("Your cuDNN version is more recent than " ERROR (theano.gpuarray): Could not initialize pygpu, support disabled Traceback (most recent call last): File "/home/.../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/gpuarray/_init _.py", line 227, in use(config.device) File "/home/.../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/gpuarray/_init _.py", line 214, in use init_dev(device, preallocate=preallocate) File "/home/.../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/gpuarray/_init _.py", line 117, in init_dev context.cudnn_handle = dnn._make_handle(context) File "/home/.../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/gpuarray/dnn.py" , line 130, in _make_handle "This can be a sign of a too old driver.", err) RuntimeError: ('Error creating cudnn handle. This can be a sign of a too old driver.', 1) SET loss TO bpr-max (type: <class 'str'>) SET final_act TO elu-0.5 (type: <class 'str'>) SET hidden_act TO tanh (type: <class 'str'>) SET layers TO [100] (type: <class 'list'>) SET adapt TO adagrad (type: <class 'str'>) SET n_epochs TO 10 (type: <class 'int'>) SET batch_size TO 32 (type: <class 'int'>) SET dropout_p_embed TO 0.0 (type: <class 'float'>) SET dropout_p_hidden TO 0.0 (type: <class 'float'>) SET learning_rate TO 0.2 (type: <class 'float'>) SET momentum TO 0.3 (type: <class 'float'>) SET n_sample TO 2048 (type: <class 'int'>) SET sample_alpha TO 0.0 (type: <class 'float'>) SET bpreg TO 1.0 (type: <class 'float'>) SET constrained_embedding TO False (type: <class 'bool'>)

Loading training data... Loading data from TAB separated file: examples/rsc15/processed/rsc15_train_tr.txt Started training The dataframe is not sorted by SessionId, sorting now Data is sorted in 46.12 Traceback (most recent call last): File "run.py", line 109, in gru.fit(data, sample_store=args.sample_store_size, store_type='gpu') File "/home/../GRU4Rec/gru4rec.py", line 556, in fit generate_samples = theano.function([], updates=updates_st) File "/home/../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/compile/function .py", line 317, in function output_keys=output_keys) File "/home/../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/compile/pfunc.py ", line 486, in pfunc output_keys=output_keys) File "/home/../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/compile/function _module.py", line 1841, in orig_function fn = m.create(defaults) File "/home/../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/compile/function _module.py", line 1715, in create input_storage=input_storage_lists, storage_map=storage_map) File "/home/../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/gof/link.py", li ne 699, in make_thunk storage_map=storage_map)[:3] File "/home/../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/gof/vm.py", line 1091, in make_all impl=impl)) File "/home/../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/gof/op.py", line 955, in make_thunk no_recycling) File "/home/../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/gof/op.py", line 858, in make_c_thunk output_storage=node_output_storage) File "/home/../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/gof/cc.py", line 1217, in make_thunk keep_lock=keep_lock) File "/home/../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/gof/cc.py", line 1157, in compile keep_lock=keep_lock) File "/home/../anaconda2/envs/gru4rec/lib/python3.6/site-packages/theano/gof/cc.py", line 1641, in cthunk_factory *(in_storage + out_storage + orphd)) RuntimeError: ('The following error happened while compiling the node', GpuBinarySearchSorted{context_name=None, dtype_int64=True}(GpuFromHost<None>.0, GpuFromHost<None>.0), '\n', 'GpuKernel_init error 3: nvrtcCompileProgram: NVRTC_ERROR_BUILTIN_OPERATION_FAILURE')

OS: Debian 4.9.110-3+deb9u4~deb8u1 (2018-08-24) x86_64 GNU/Linux cudnn: 7.6 cuda: 9.2 theano: 1.0.4 pygpu: 0.7.6 libgpuarray: 0.7.6

Anyone could help?

Abigale001 avatar May 15 '20 08:05 Abigale001

This is not a GRU4Rec related error, but a sign that something in your Theano setup is not correct (it's not even the fault of Theano, but there is some kind of incompatibility between the driver/cuda/cuDNN on your system).

Some ideas on what could have gone wrong:

  • Maybe your nVidia driver is too old (as the error message suggests). You should have at least r396.26, but it also works with newer (for example I use r430.14). If your driver is old, try upgrading it. (When upgrading the driver pay attention to the runfile vs. package issue.)
  • Maybe the cuDNN you use was built for a different cuda version (there are different cuDNN 7.6 libs built for cuda 9.2, cuda 10.0, cuda 10.2, etc). Double check that your cuda and cuDNN match.
  • Maybe when you built libgpuarray/pygpu, you did it in a different environment and it links to a different cuda/cuDNN version than you are using currently.
  • It is less likely, but maybe a different cuda version is in your PATH or in LD_LIBRARY_PATH or in CUDA_HOME (or these environment variables don't point to any cuda version).
  • You can also try disabling cuDNN usage in Theano by running it with the appropriate flag: THEANO_FLAGS=dnn.enabled=False python run.py .... I don't think that this has a significant impact on the speed of GRU4Rec, but I haven't tried yet. If setting this flag helps, the issue is definitely some kind of incompatibility between your cuDNN and cuda (and/or nVidia driver).
  • You can also try to delete the theano cache (theano-cache purge or manually deleting the contents of ~/.theano/), maybe something is stuck there from a previous version.

If nothing else works, you can set up the whole environment from the ground up. I usually do it this way, because then I know exactly what was installed. It has worked for me 100% of the time. The main steps are:

  • Uninstall your current cuda and nVidia driver (if these are installed)
  • Install the nVidia driver (>=396.26)
  • Install cuda 9.2
  • Set the CUDA_HOME environment variable to <cuda_install_dir> (default: /usr/local/cuda-9.2), and make sure that <cuda_install_dir>/bin and <cuda_install_dir>/lib64 are added to PATH and LD_LIBRARY_PATH respectively.
  • Download the appropriate cuDNN libs and copy them under cuda/lib64
  • (If you start on a new machine, this is the point where you also install python and the required packages (optionally in pyenv/virtualenv))
  • Install pycuda
  • Build and install libgpuarray & pygpu
  • Build and install Theano

hidasib avatar May 15 '20 10:05 hidasib

Thank you very much. I will check the problem according to your comments.

Abigale001 avatar May 15 '20 14:05 Abigale001