GPUifyLoops.jl
GPUifyLoops.jl copied to clipboard
Allow users to disable fp contract optimizations ?
Recently I ran into a surprising (for me) behavior demonstrated by the MWE below
using CuArrays, CUDAnative, GPUifyLoops
function kernel(rho, T)
P = rho[1] * T[1]
if (abs(P - P) > 1e-16)
@cuprintf("diff = %.16e\n", P - P)
end
nothing
end
rho = CuArray([1e-1])
T = CuArray([300.0])
@launch CUDA() kernel(rho, T, threads=1, blocks=1)
with the output
diff = 1.6653345369377348e-15
Basically, if my understanding of the generated PTX is correct, what happens
is that P - P
is calculated as fma(rho[1], T[1], -P)
which is probably not the smartest move by the compiler. However, clang with LLVM-6.0.1 also does this for CUDA C so I guess that's expected. This issue goes away if I disable contracts. In clang there's an option for that called -ffp-contract
Maybe adding a similar option in GPUifyLoops would be helpful for debugging ?
For convenience, the generated PTX can be found here:
https://gist.github.com/mwarusz/5ab4ac99b02e77b54178cd95c9820d7b
Thanks for bringing this up, the goal in #55 was indeed to match Clang (we were hunting down a performance gap).
I agree that the fact that we use contract unconditionally is probably not what we want in the long-term. Julia in general tries to provide localised control to the user (compare @fastmath
).
But yeah:
mul.f64 %fd4, %fd1, %fd3;
neg.f64 %fd5, %fd4;
fma.rn.f64 %fd2, %fd1, %fd3, %fd5;
abs.f64 %fd6, %fd2;
is kinda funny, the only explanation I have is that the fma
units are the fastest thing.