pyinn icon indicating copy to clipboard operation
pyinn copied to clipboard

How to solve the float precision error problem?

Open leftthomas opened this issue 6 years ago • 2 comments

@szagoruyko Thanks for your repository, I have studied how to use CuPy to write my own CUDA Kernel function and call it with PyTorch, but I found if I used for loop in a CUDA Kernel function, and multiply add ops, it will encounter the precision error problem. I have written a CUDA Kernel function of Vector dot op, and tested it with 512 length, the assert code-assert (y_fast - y_ref).data.abs().max() < 1e-6 will be failed. But I found your _conv2d_depthwise_kernel CUDA Kernel function also have for loop and multiply add ops, then I changed your test_modules function to module = Conv2dDepthwise(channels=8, kernel_size=25), x = Variable(torch.randn(1,8,256,320)), it won't got the precision error problem, I wonder know why that happened? My Vector dot op and test codes in the repo CuPyLearn, could you help me to figure out why? Thanks!

leftthomas avatar Mar 02 '18 18:03 leftthomas

@leftthomas you probably need to accumulate in double precision

szagoruyko avatar Mar 03 '18 08:03 szagoruyko

@szagoruyko But I found your conv2d_depthwise_kernel used $(Dtype) value to accumulate the result, and on test_modules function, it uses FloatTensor to test, according to the utils.py, the $(Dtype) value will be replaced with float value on the run time, so I'm confused about why your code can pass the test, and my code can't. I have also searched some methods to solve the precision error problem, such as Kahan’s Summation Formula, I implemented it inside the CUDA Kernel Function, but it didn't work. Using double to accumulate the sum value is the only way or the best way to solve this problem? Furthermore, I just tested the CUDA Kernel function by using double to accumulate the sum value, and it failed on the 512 length vector, here is my test result screenshot qq20180303-174334 2x

leftthomas avatar Mar 03 '18 09:03 leftthomas