cudnn.torch icon indicating copy to clipboard operation
cudnn.torch copied to clipboard

cudnnConvolutionBackwardData failed - Error in CuDNN: CUDNN_STATUS_NOT_SUPPORTED (cudnnConvolutionBackwardData)

Open ProGamerGov opened this issue 8 years ago • 8 comments

I'm not sure what is causing this error, and how to fix it:

cudnnConvolutionBackwardData failed:    9        convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA1,3,2615,2816 -filtA64,3,3,3 1,64,2615,2816 -padA1,1 -convStrideA1,1 CUDNN_DATA_FLOAT
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 1 module of nn.Sequential:
/home/ubuntu/torch/install/share/lua/5.1/cudnn/find.lua:94: Error in CuDNN: CUDNN_STATUS_NOT_SUPPORTED (cudnnConvolutionBackwardData)
stack traceback:
        [C]: in function 'error'
        /home/ubuntu/torch/install/share/lua/5.1/cudnn/find.lua:94: in function 'checkedCall'
        ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:212: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:201>
        [C]: in function 'xpcall'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:58: in function </home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:50>
        [C]: in function 'pcall'
        /home/ubuntu/torch/install/share/lua/5.1/cutorch/init.lua:32: in function 'withDevice'
        /home/ubuntu/torch/install/share/lua/5.1/nn/GPU.lua:112: in function </home/ubuntu/torch/install/share/lua/5.1/nn/GPU.lua:108>
        [C]: in function 'xpcall'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:58: in function 'updateGradInput'
        neural_style.lua:284: in function 'opfunc'
        /home/ubuntu/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam'
        neural_style.lua:307: in function 'main'
        neural_style.lua:601: in main chunk
        [C]: in function 'dofile'
        ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50

ProGamerGov avatar Oct 25 '17 22:10 ProGamerGov

I have been trying to push things as far as they can go, and may have hit a limit in Torch7 and/or cuDNN, because search engines don't really show anything for this error.

I was running the latest version of Torch, Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-1038-aws x86_64), and Cuda 9.0, with cuDNN v7.

ProGamerGov avatar Oct 25 '17 22:10 ProGamerGov

I assume this error is because of a limitation in the maximum value possible? So this maximum could be changed?

ProGamerGov avatar Oct 30 '17 19:10 ProGamerGov

The error appears to come from these areas:

In SpatialConvolution.lua, on line 201: https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua#L201

In SpatialConvolution.lua, on line 209: https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua#L209

@soumith How do I fix this limitation?

ProGamerGov avatar Oct 31 '17 21:10 ProGamerGov

Related Issues:

https://github.com/jzbontar/mc-cnn/issues/16

https://github.com/allenai/XNOR-Net/issues/22

https://github.com/soumith/dcgan.torch/issues/67

https://github.com/facebook/fb.resnet.torch/issues/153

ProGamerGov avatar Oct 31 '17 21:10 ProGamerGov

After using cudnn.verbose = true, it seems that it may be a lack of memory issue after all:

https://gist.github.com/ProGamerGov/9e5b367a90cd4be9cbd1ed023dafbb81

I thought I could go a lot higher in terms of image size in Neural-Style, but I did that one the install with an earlier version of Torch and Cuda/cuDNN. Either Torch7 or Cuda/cuDNN has gotten more inefficient, and that is probably why I can't get any higher in terms of image size: https://github.com/jcjohnson/neural-style/issues/429

ProGamerGov avatar Oct 31 '17 22:10 ProGamerGov

Try limiting your workspace size by setting cudnn.maxWorkspaceGPUMemPercent (say, to 30 or 40)

ngimel avatar Oct 31 '17 22:10 ngimel

Hi guys, I was wondering if any of you has any progress on this problem. I have a similar error with cudnnConvolutionBackwardFilter. See below for the full error message,

cudnnConvolutionBackwardFilter failed: 9 convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA93700,3,20,9 -filtA10,3,9,9 93700,10,12,1 -padA0,0 -convStrideA1,1 CUDNN_DATA_FLOAT /usr/local/mnt/vega_scratch/scratch/bio_vad/src/torch/install/bin/luajit: ...bio_vad/src/torch/install/share/lua/5.1/nn/Container.lua:67: In 1 module of nn.Sequential: In 2 module of nn.Sequential: ...h/bio_vad/src/torch/install/share/lua/5.1/cudnn/find.lua:94: Error in CuDNN: CUDNN_STATUS_NOT_SUPPORTED (cudnnConvolutionBackwardFilter) stack traceback: [C]: in function 'error' ...h/bio_vad/src/torch/install/share/lua/5.1/cudnn/find.lua:94: in function 'checkedCall' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:264: in function 'accGradParameters' ...ch/bio_vad/src/torch/install/share/lua/5.1/nn/Module.lua:32: in function <...ch/bio_vad/src/torch/install/share/lua/5.1/nn/Module.lua:29> [C]: in function 'xpcall' ...bio_vad/src/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' ...io_vad/src/torch/install/share/lua/5.1/nn/Sequential.lua:87: in function <...io_vad/src/torch/install/share/lua/5.1/nn/Sequential.lua:81> [C]: in function 'xpcall' ...bio_vad/src/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' ...io_vad/src/torch/install/share/lua/5.1/nn/Sequential.lua:91: in function 'backward' ...ai/code/CLVTtorch/CLVT_SSF_Trainer/train_noSequencer.lua:106: in function 'opfunc' ...o_vad/src/torch/install/share/lua/5.1/optim/adadelta.lua:31: in function 'optimMethod' ...ai/code/CLVTtorch/CLVT_SSF_Trainer/train_noSequencer.lua:212: in main chunk [C]: in function 'dofile' ...ode/CLVTtorch/CLVT_SSF_Trainer/trainCLVT_noSequencer.lua:124: in main chunk [C]: in function 'dofile' .../src/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x004064f0 Is this a memory issue?

Cheers

Kevinpsk avatar Dec 08 '17 13:12 Kevinpsk

@ProGamerGov Do you have solved this problem?

ChangshiFan avatar Aug 10 '18 06:08 ChangshiFan