GPUCompiler.jl New deferred_codegen implementation

Draft implementation https://github.com/JuliaGPU/GPUCompiler.jl/issues/581

@aviatesk pointed me to:

https://github.com/JuliaLang/julia/blob/b9f68ac0afe7ad896fe2803b139b9c32103ac417/base/compiler/ssair/inlining.jl#L580
https://github.com/JuliaLang/julia/blob/d61db204e576fa53bd897b2e6b1298ad9ad39f40/base/compiler/ssair/inlining.jl#L995-L996

For the refinement. So we will need a small pass that rewrites the call, post abstract interpretation.

May 13 '24 15:05 vchuravy

Codecov Report

Attention: Patch coverage is 0% with 139 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (d68a7fc) to head (927fc30).

Files	Patch %	Lines
src/jlgen.jl	0.00%	77 Missing :warning:
src/driver.jl	0.00%	57 Missing :warning:
src/irgen.jl	0.00%	5 Missing :warning:

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #582    +/-   ##
=======================================
  Coverage    0.00%   0.00%            
=======================================
  Files          24      24            
  Lines        3064    3190   +126     
=======================================
- Misses       3064    3190   +126

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

May 13 '24 19:05 codecov[bot]

Okay this looks like the right direction

Using:

@noinline child(i) = i
kernel(i) = GPUCompiler.var"gpuc.deferred"(child, i)

This gets refined from

GPUCompiler.code_typed(job, optimize=false)
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = GPUCompiler.:(var"gpuc.deferred")::Core.Const(GPUCompiler.var"gpuc.deferred")
│   %2 = Main.child::Core.Const(child)
│   %3 = (%1)(%2, i)::Ptr{Nothing}
└──      return %3
) => Ptr{Nothing}

to

GPUCompiler.code_typed(job, optimize=true)
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = (GPUCompiler.var"gpuc.lookup")(MethodInstance for child(::Int64), Main.child, i)::Ptr{Nothing}
└──      return %1
) => Ptr{Nothing}

but codegen doesn't like what we are doing and generates a julia.call, which involves boxing.

;  @ /home/vchuravy/src/GPUCompiler/deferred.jl:13 within `kernel`
define i64 @julia_kernel_4280(i64 signext %0) local_unnamed_addr {
top:
  %1 = call {}*** @julia.get_pgcstack()
  %2 = bitcast {}*** %1 to {}**
  %current_task = getelementptr inbounds {}*, {}** %2, i64 -14
  %3 = bitcast {}** %current_task to i64*
  %world_age = getelementptr inbounds i64, i64* %3, i64 15
  %4 = call fastcc nonnull {}* @ijl_box_int64(i64 signext %0)
  %5 = call nonnull {}* ({}* ({}*, {}**, i32)*, {}*, ...) @julia.call({}* ({}*, {}**, i32)* @ijl_apply_generic, {}* inttoptr (i64 125916715977696 to {}*), {}* inttoptr (i64 125918231668992 to {}*), {}* inttoptr (i64 125916747896912 to {}*), {}* %4)
  %6 = bitcast {}* %5 to i64*
  %unbox = load i64, i64* %6, align 8
  ret i64 %unbox
}

May 15 '24 01:05 vchuravy

Okay much nicer instead of refining to a Julia function we go straight to a llvmcall

;  @ /home/vchuravy/src/GPUCompiler/deferred.jl:13 within `kernel`
define i64 @julia_kernel_451(i64 signext %0) local_unnamed_addr {
top:
  %1 = call {}*** @julia.get_pgcstack()
  %2 = bitcast {}*** %1 to {}**
  %current_task = getelementptr inbounds {}*, {}** %2, i64 -14
  %3 = bitcast {}** %current_task to i64*
  %world_age = getelementptr inbounds i64, i64* %3, i64 15
  %4 = call i64 @gpuc.lookup({}* inttoptr (i64 132498188785376 to {}*), {}* inttoptr (i64 132498210393720 to {}*), i64 %0)
  ret i64 %4
}

@maleadt added benefit is that this should handle invalidations of child correctly xD

May 15 '24 01:05 vchuravy

@maleadt I left the old implementation alive since Enzyme is using it.

We could add a token to declare who owns it...

The addition of AbstractGPUCompiler is so that Enzyme can inherit the implementation here, and maybe customize it. But I haven't thought that interaction fully through.

Potentially Enzyme ought to have a enz.lookup function instead, but then we need to somehow make the processing here extendable.

Jun 28 '24 18:06 vchuravy

But this now sucessfully turns

; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-linux-gnu"

;  @ /home/vchuravy/src/GPUCompiler/deferred.jl:13 within `kernel`
define i64 @julia_kernel_400(i64 signext %0) local_unnamed_addr {
top:
  %1 = call i64 @gpuc.lookup({}* inttoptr (i64 139647662117264 to {}*), {}* inttoptr (i64 139646283783472 to {}*), i64 %0)
  ret i64 %1
}

declare i64 @gpuc.lookup({}*, {}*, i64) local_unnamed_addr

!llvm.module.flags = !{!0, !1}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}

Into

; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-linux-gnu"

;  @ /home/vchuravy/src/GPUCompiler/deferred.jl:13 within `kernel`
define i64 @julia_kernel_400(i64 signext %0) local_unnamed_addr {
top:
  ret i64 ptrtoint (i64 (i64)* @julia_child_465 to i64)
}

;  @ /home/vchuravy/src/GPUCompiler/deferred.jl:12 within `child`
; Function Attrs: noinline
define i64 @julia_child_465(i64 signext %0) local_unnamed_addr #0 {
top:
  ret i64 %0
}

attributes #0 = { noinline }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}

From:

@noinline child(i) = i
kernel(i) = GPUCompiler.var"gpuc.deferred"(child, i)

Jun 28 '24 18:06 vchuravy

@maleadt I left the old implementation alive since Enzyme is using it.

Why does Enzyme always require horrible things... Can't this just be a breaking release?

Jul 04 '24 13:07 maleadt

Why does Enzyme always require horrible things... Can't this just be a breaking release?

I wonder that myself... Yeah we can make this a breaking release, but I will need some more time to figure out how the Enzyme part will work.

Jul 04 '24 16:07 vchuravy

@maleadt The specfunc name is "julia_##child#234_3583" But the name in the IR is "julia___child_237_18493"

Any ideas when that sanitation is happening?

Jul 16 '24 19:07 vchuravy

Ah it's probably https://github.com/JuliaGPU/GPUCompiler.jl/blob/3c80a5d58131cea618a24a5c63b7e1f86b129297/src/irgen.jl#L52

Jul 16 '24 19:07 vchuravy

#636
#634
#633
#582 👈 (View in Graphite)
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

Sep 26 '24 07:09 vchuravy