GPUCompiler.jl icon indicating copy to clipboard operation
GPUCompiler.jl copied to clipboard

[AMDGPU] Broken `sincos` intrinsic on GPUCompiler 0.24

Open pxl-th opened this issue 2 years ago • 4 comments

On [email protected] kernel outputs:

julia> main()
y = Float32[0.84147096, 0.5403023]

But on [email protected] (same output for any 0.24.x version):

julia> main()
y = Float32[0.84147096, NaN]
ERROR: AssertionError: sum(y) == sum(sincos(1.0f0))

MWE:

using Adapt
using AMDGPU
using KernelAbstractions
import KernelAbstractions as KA

@kernel function sc1(y, x)
    i = @index(Global)
    @inbounds s, c = sincos(x[i])
    @inbounds y[1] = s
    @inbounds y[2] = c
end

function main()
    kab = ROCBackend()
    x = adapt(kab, ones(Float32, 1))
    y = KA.allocate(kab, Float32, 2)
    sc1(kab)(y, x; ndrange=1)
    @show y
    @assert sum(y) == sum(sincos(1f0))
    return
end

Also attaching @device_code output for both versions. In .opt.ll they start to differ at line 288:

pxl-th avatar Sep 25 '23 11:09 pxl-th

Which Julia version? If 1.10, I'd suspect the NewPM changes (which is the only one that stands out in https://github.com/JuliaGPU/GPUCompiler.jl/compare/v0.23.0...v0.24.5).

maleadt avatar Sep 26 '23 07:09 maleadt

Yes, Julia 1.10

pxl-th avatar Sep 26 '23 07:09 pxl-th

Just to confirm it was the newpm change. And not sure how relevant this is. but even with -O0 it reproduces.

gbaraldi avatar Sep 29 '23 00:09 gbaraldi

Allocopt is causing this. We turn a

  %newstruct32 = call noalias nonnull dereferenceable(4) {} addrspace(10)* @julia.gc_alloc_obj({}** %current_task31, i64 4, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140091797818128 to {}*) to {} addrspace(10)*)) #0, !dbg !187
  %33 = addrspacecast {} addrspace(10)* %newstruct32 to {} addrspace(11)*, !dbg !196
  %34 = call nonnull {}* @julia.pointer_from_objref({} addrspace(11)* %33) #1, !dbg !196

into

  %3 = alloca i32, align 8, addrspace(5)
  %4 = bitcast i32 addrspace(5)* %3 to i8 addrspace(5)*
  %5 = bitcast i8 addrspace(5)* %4 to {} addrspace(5)*
  %newstruct32 = addrspacecast {} addrspace(5)* %5 to {} addrspace(10)*
...
  call void @llvm.lifetime.start.p5i8(i64 4, i8 addrspace(5)* %4)
  %36 = addrspacecast {} addrspace(10)* %newstruct32 to {} addrspace(11)*, !dbg !196
  %37 = call nonnull {}* @julia.pointer_from_objref({} addrspace(11)* %36) #1, !dbg !196

Those addresspaces don't seem very valid.

gbaraldi avatar Sep 29 '23 17:09 gbaraldi

This seems fine on 1.10.5. I think this got fixed in https://github.com/JuliaLang/julia/commit/af9a7af3b27c0ff22179a7dcd30bf6753d3d575f

gbaraldi avatar Sep 02 '24 19:09 gbaraldi

That commit isn't in 1.10 though, so probably something in GPUCompiler changed too?

maleadt avatar Sep 02 '24 20:09 maleadt

It is in 1.10 but I think it got a manual backport https://github.com/JuliaLang/julia/commit/99c4ae46a610a62699a6a59f66d21723734b911f

gbaraldi avatar Sep 03 '24 17:09 gbaraldi

Oh, great, thanks for confirming.

maleadt avatar Sep 03 '24 17:09 maleadt