New deferred_codegen implementation
Draft implementation https://github.com/JuliaGPU/GPUCompiler.jl/issues/581
@aviatesk pointed me to:
- https://github.com/JuliaLang/julia/blob/b9f68ac0afe7ad896fe2803b139b9c32103ac417/base/compiler/ssair/inlining.jl#L580
- https://github.com/JuliaLang/julia/blob/d61db204e576fa53bd897b2e6b1298ad9ad39f40/base/compiler/ssair/inlining.jl#L995-L996
For the refinement. So we will need a small pass that rewrites the call, post abstract interpretation.
Codecov Report
Attention: Patch coverage is 0% with 139 lines in your changes missing coverage. Please review.
Project coverage is 0.00%. Comparing base (
d68a7fc) to head (927fc30).
| Files | Patch % | Lines |
|---|---|---|
| src/jlgen.jl | 0.00% | 77 Missing :warning: |
| src/driver.jl | 0.00% | 57 Missing :warning: |
| src/irgen.jl | 0.00% | 5 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #582 +/- ##
=======================================
Coverage 0.00% 0.00%
=======================================
Files 24 24
Lines 3064 3190 +126
=======================================
- Misses 3064 3190 +126
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Okay this looks like the right direction
Using:
@noinline child(i) = i
kernel(i) = GPUCompiler.var"gpuc.deferred"(child, i)
This gets refined from
GPUCompiler.code_typed(job, optimize=false)
1-element Vector{Any}:
CodeInfo(
1 ─ %1 = GPUCompiler.:(var"gpuc.deferred")::Core.Const(GPUCompiler.var"gpuc.deferred")
│ %2 = Main.child::Core.Const(child)
│ %3 = (%1)(%2, i)::Ptr{Nothing}
└── return %3
) => Ptr{Nothing}
to
GPUCompiler.code_typed(job, optimize=true)
1-element Vector{Any}:
CodeInfo(
1 ─ %1 = (GPUCompiler.var"gpuc.lookup")(MethodInstance for child(::Int64), Main.child, i)::Ptr{Nothing}
└── return %1
) => Ptr{Nothing}
but codegen doesn't like what we are doing and generates a julia.call, which involves boxing.
; @ /home/vchuravy/src/GPUCompiler/deferred.jl:13 within `kernel`
define i64 @julia_kernel_4280(i64 signext %0) local_unnamed_addr {
top:
%1 = call {}*** @julia.get_pgcstack()
%2 = bitcast {}*** %1 to {}**
%current_task = getelementptr inbounds {}*, {}** %2, i64 -14
%3 = bitcast {}** %current_task to i64*
%world_age = getelementptr inbounds i64, i64* %3, i64 15
%4 = call fastcc nonnull {}* @ijl_box_int64(i64 signext %0)
%5 = call nonnull {}* ({}* ({}*, {}**, i32)*, {}*, ...) @julia.call({}* ({}*, {}**, i32)* @ijl_apply_generic, {}* inttoptr (i64 125916715977696 to {}*), {}* inttoptr (i64 125918231668992 to {}*), {}* inttoptr (i64 125916747896912 to {}*), {}* %4)
%6 = bitcast {}* %5 to i64*
%unbox = load i64, i64* %6, align 8
ret i64 %unbox
}
Okay much nicer instead of refining to a Julia function we go straight to a llvmcall
; @ /home/vchuravy/src/GPUCompiler/deferred.jl:13 within `kernel`
define i64 @julia_kernel_451(i64 signext %0) local_unnamed_addr {
top:
%1 = call {}*** @julia.get_pgcstack()
%2 = bitcast {}*** %1 to {}**
%current_task = getelementptr inbounds {}*, {}** %2, i64 -14
%3 = bitcast {}** %current_task to i64*
%world_age = getelementptr inbounds i64, i64* %3, i64 15
%4 = call i64 @gpuc.lookup({}* inttoptr (i64 132498188785376 to {}*), {}* inttoptr (i64 132498210393720 to {}*), i64 %0)
ret i64 %4
}
@maleadt added benefit is that this should handle invalidations of child correctly xD
@maleadt I left the old implementation alive since Enzyme is using it.
We could add a token to declare who owns it...
The addition of AbstractGPUCompiler is so that Enzyme can inherit the implementation here, and maybe customize it.
But I haven't thought that interaction fully through.
Potentially Enzyme ought to have a enz.lookup function instead, but then we need to somehow make the processing here extendable.
But this now sucessfully turns
; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-linux-gnu"
; @ /home/vchuravy/src/GPUCompiler/deferred.jl:13 within `kernel`
define i64 @julia_kernel_400(i64 signext %0) local_unnamed_addr {
top:
%1 = call i64 @gpuc.lookup({}* inttoptr (i64 139647662117264 to {}*), {}* inttoptr (i64 139646283783472 to {}*), i64 %0)
ret i64 %1
}
declare i64 @gpuc.lookup({}*, {}*, i64) local_unnamed_addr
!llvm.module.flags = !{!0, !1}
!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
Into
; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-linux-gnu"
; @ /home/vchuravy/src/GPUCompiler/deferred.jl:13 within `kernel`
define i64 @julia_kernel_400(i64 signext %0) local_unnamed_addr {
top:
ret i64 ptrtoint (i64 (i64)* @julia_child_465 to i64)
}
; @ /home/vchuravy/src/GPUCompiler/deferred.jl:12 within `child`
; Function Attrs: noinline
define i64 @julia_child_465(i64 signext %0) local_unnamed_addr #0 {
top:
ret i64 %0
}
attributes #0 = { noinline }
!llvm.module.flags = !{!0, !1}
!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
From:
@noinline child(i) = i
kernel(i) = GPUCompiler.var"gpuc.deferred"(child, i)
@maleadt I left the old implementation alive since Enzyme is using it.
Why does Enzyme always require horrible things... Can't this just be a breaking release?
Why does Enzyme always require horrible things... Can't this just be a breaking release?
I wonder that myself... Yeah we can make this a breaking release, but I will need some more time to figure out how the Enzyme part will work.
@maleadt The specfunc name is "julia_##child#234_3583"
But the name in the IR is "julia___child_237_18493"
Any ideas when that sanitation is happening?
Ah it's probably https://github.com/JuliaGPU/GPUCompiler.jl/blob/3c80a5d58131cea618a24a5c63b7e1f86b129297/src/irgen.jl#L52
- #636

- #634

- #633

- #582
👈 (View in Graphite) master
This stack of pull requests is managed by Graphite. Learn more about stacking.