ai-research-code icon indicating copy to clipboard operation
ai-research-code copied to clipboard

【NVC-Net】RuntimeError: target_specific error in backward_impl. Failed `status == CUDNN_STATUS_SUCCESS`: UNKNOWN

Open Kanraaaaa opened this issue 3 years ago • 1 comments

Hi, I try to train NVC-Net on single gpu, but I meet some errors as follows:

value error in query /home/gitlab-runner/builds/jmdP2aBr/1/nnabla/builders/all/nnabla/include/nbla/function_registry.hpp:69 Failed it != items_.end(): Any of [cudnn:float, cuda:float, cpu:float] could not be found in []

No communicator found. Running with a single process. If you run this with MPI processes, all processes will perform totally same. 2022-02-15 17:16:13,887 [nnabla][INFO]: Training data with 100 speakers. 2022-02-15 17:16:13,888 [nnabla][INFO]: DataSource with shuffle(True) 2022-02-15 17:16:13,934 [nnabla][INFO]: Using DataIterator Running epoch=1 lr=0.00010 Error during backward propagation: Add2CudaCudnn Add2CudaCudnn Add2CudaCudnn MulScalarCuda MeanCudaCudnn SquaredErrorCuda Div2Cuda PowScalarCuda SumCuda AddScalarCuda PowScalarCuda ConvolutionCudaCudnn PadCuda GELUCuda ConvolutionCudaCudnn PadCuda GELUCuda ConvolutionCudaCudnn GELUCuda Add2CudaCudnn ConvolutionCudaCudnn Mul2Cuda TanhCudaCudnn <-- ERROR Traceback (most recent call last): File "main.py", line 99, in run(args) File "main.py", line 70, in run Trainer(gen, gen_optim, dis, dis_optim, dataloader, rng, hp).run() File "11_ai-research-code-master/nvcnet/train.py", line 157, in run self.train_on_batch(i) File "11_ai-research-code-master/nvcnet/train.py", line 197, in train_on_batch p['g_loss'].backward(clear_buffer=True) File "_variable.pyx", line 826, in nnabla._variable.Variable.backward RuntimeError: target_specific error in backward_impl /home/gitlab-runner/builds/-phDBBa6/0/nnabla/builders/all/nnabla-ext-cuda/src/nbla/cuda/cudnn/function/./generic/tanh.cu:79 Failed status == CUDNN_STATUS_SUCCESS: UNKNOWN

I had followed the install page: https://nnabla.org/install/, but it does not work. Could you please give some suggestion? My environments as follows: CUDA11.0, cudnn 8.1.0, python 3.6.8

Thank you ! Look forward to your kind reply.

Kanraaaaa avatar Feb 16 '22 02:02 Kanraaaaa

Thank you for checking. Forward propagation seems to be working, so I hope there is no problem installing nnabla... If you can use docker, could you please try to run with docker?

cd nvcnet
./scripts/docker_build.sh
docker run --gpus all -u $(id -u):$(id -g) -v $HOME:$HOME -w $(pwd) --rm -it nvcnet:latest /bin/bash
export NUMBA_CACHE_DIR=/tmp
python main.py -c cudnn -d 0

If this is same error, could you please provide GPU information by nvidia-smi -L ?

TomonobuTsujikawa avatar Feb 18 '22 03:02 TomonobuTsujikawa