caffe2
caffe2 copied to clipboard
caffe2 conv - cudnnAddTensor for bias generates wrong output! (HALF, CUDNN6.0, in special cases)
cudnnAddTensor generates an error on output blob :(
specification: in my case, float16 input tensor (NCHW) = (8, 256, 28, 28) output tensor (NCHW) = (8, 512, 28, 28) 1x1 kernel (it almost occurs on 1x1 kernels)
batch N = 1~3 : it works well, but it produces wrong output when N>=4
tested on Jetson TX2 and server with GTX1080Ti
it may due to CuDNN but I write this issue for notifying
Might be a downstream problem? cc @slayton58 if you guys have a handy TX2/1080Ti on hand to check :)
Ok, to clarify - the call to cudnnAddTensor
is coming from the bias addition in a convolution operator?
In which case you actually want (all in NCHW): (8, 512, 28, 28) + (1, 512, 1, 1) in fp16?
I solved this problem with my own CUDA implementation for adding bias and it works well :). (but little bit slower than CuDNN)
@slayton58 Do you recall any changes landing recently that could have solved this?
@pietern Still waiting on more information from @kangdongh otherwise I can't comment
@slayton58 sorry for late response :(. Actually, I'm not friendly with using github issues First, I found that a huge error in result was derived in different batch sizes (1 vs 8) when I tried to test VGG + tucker decomposition + FP16 with bigger batch size for a small contest. Then, I had tried debugging to pinpoint where the problem came and I found that the error in result came from certain convolution operators which have bias term.
However, I changed cudnnAddTensor(~) in cudnn conv operator with my own CUDA code to solve this problem. The issue is still remaining; to clarify:
- FP16,
- batch size is bigger than certain constant value.
- use cudnn conv operator, specifically with bias term (then it uses cudnnaddtensor)
- it was corrected when I changed the caffe2 conv code , conv_op_cudnn.cc, 610 ~ 618 line with other implementation.