GPUifyLoops.jl icon indicating copy to clipboard operation
GPUifyLoops.jl copied to clipboard

Allow users to disable fp contract optimizations ?

Open mwarusz opened this issue 5 years ago • 1 comments

Recently I ran into a surprising (for me) behavior demonstrated by the MWE below

using CuArrays, CUDAnative, GPUifyLoops

function kernel(rho, T)
  P = rho[1] * T[1]

  if (abs(P - P) > 1e-16)
    @cuprintf("diff = %.16e\n", P - P)
  end
  nothing
end

rho = CuArray([1e-1])
T = CuArray([300.0])
@launch CUDA() kernel(rho, T, threads=1, blocks=1)

with the output

diff = 1.6653345369377348e-15

Basically, if my understanding of the generated PTX is correct, what happens is that P - P is calculated as fma(rho[1], T[1], -P) which is probably not the smartest move by the compiler. However, clang with LLVM-6.0.1 also does this for CUDA C so I guess that's expected. This issue goes away if I disable contracts. In clang there's an option for that called -ffp-contract Maybe adding a similar option in GPUifyLoops would be helpful for debugging ?

For convenience, the generated PTX can be found here:

https://gist.github.com/mwarusz/5ab4ac99b02e77b54178cd95c9820d7b

mwarusz avatar Jul 10 '19 21:07 mwarusz

Thanks for bringing this up, the goal in #55 was indeed to match Clang (we were hunting down a performance gap).

I agree that the fact that we use contract unconditionally is probably not what we want in the long-term. Julia in general tries to provide localised control to the user (compare @fastmath).

But yeah:

	mul.f64 	%fd4, %fd1, %fd3;
	neg.f64 	%fd5, %fd4;
	fma.rn.f64 	%fd2, %fd1, %fd3, %fd5;
	abs.f64 	%fd6, %fd2;

is kinda funny, the only explanation I have is that the fma units are the fastest thing.

vchuravy avatar Jul 11 '19 14:07 vchuravy