pyinn
pyinn copied to clipboard
How to solve the float precision error problem?
@szagoruyko Thanks for your repository, I have studied how to use CuPy to write my own CUDA Kernel function and call it with PyTorch, but I found if I used for loop
in a CUDA Kernel function, and multiply add ops, it will encounter the precision error problem.
I have written a CUDA Kernel function of Vector dot op, and tested it with 512 length, the assert code-assert (y_fast - y_ref).data.abs().max() < 1e-6
will be failed. But I found your _conv2d_depthwise_kernel
CUDA Kernel function also have for loop
and multiply add ops, then I changed your test_modules
function to module = Conv2dDepthwise(channels=8, kernel_size=25)
, x = Variable(torch.randn(1,8,256,320))
, it won't got the precision error problem, I wonder know why that happened?
My Vector dot op and test codes in the repo CuPyLearn, could you help me to figure out why? Thanks!
@leftthomas you probably need to accumulate in double precision
@szagoruyko But I found your conv2d_depthwise_kernel
used $(Dtype) value
to accumulate the result, and on test_modules
function, it uses FloatTensor to test, according to the utils.py
, the $(Dtype) value
will be replaced with float value
on the run time, so I'm confused about why your code can pass the test, and my code can't.
I have also searched some methods to solve the precision error problem, such as Kahan’s Summation Formula
, I implemented it inside the CUDA Kernel Function, but it didn't work. Using double to accumulate the sum value is the only way or the best way to solve this problem?
Furthermore, I just tested the CUDA Kernel function by using double to accumulate the sum value, and it failed on the 512 length vector, here is my test result screenshot