WIP: Use contextual dispatch for replacing functions
On 1.1 Cassette should be performant enough for these kinds of transforms.
Fixes https://github.com/JuliaGPU/CUDAnative.jl/issues/27
@maleadt did you have a branch similar to this around?
bors try
Yes, https://github.com/JuliaGPU/CUDAnative.jl/compare/tb/cassette Didn't work because of plenty allocations, invokes, dispatches, etc. Is your approach different in that regard? Also, #265.
bors try
As bors tells us apparently not ;)
@jrevels https://gitlab.com/JuliaGPU/CUDAnative.jl/-/jobs/153739960 is full of interesting cases.
bors try
Yeah, as I feared... Let's mark this WIP then :slightly_frowning_face:
bors try
Same error count; inlining doesn't help.
That said, many stack traces point to getindex again, so maybe there's only a small number of errors remaining. I'll have another go at reducing vadd when I have some time.
I was planning on grabbing Jarrett this week and see if we can figure it out. (I am in the progress to add GPU support to Cthulhu so that should make it easier)
bors try
Ok! The debugging session with Jarrett proved fruitful, we are down to 10ish failures :)
Cool! What were the changes?
Cool! What were the changes?
We applied my usual Cassette issue workaround of "isolate the problematic thing and make it a contextual primitive (i.e. don't overdub into it)". The problematic thing here was the @pure function datatype_align.
It turns out that while Cassette propagates purity to the compiler correctly, the compiler is (probably rightfully) pessimistic and just bails out on purity optimization for generated functions (i.e. overdub). ref https://github.com/JuliaLang/julia/pull/31012, which is my naive attempt at changing the compiler to allow this sort of thing. If that lands, we can remove the extra contextual primitive definition here.
bors try
I think we are down to two Cassette related issues while the rest is adjustment of tests/one level of indirection missing things up.
julia> function kernel1(T, i)
sink(i)
return
end
kernel1 (generic function with 1 method)
julia> @cuda kernel1(Int, 1)
ERROR: InvalidIRError: compiling #103(Type{Int64}, Int64) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f_tuple)
Stacktrace:
[1] #103 at /home/tbesard/Julia/CUDAnative/src/context.jl:51
Reason: unsupported call to the Julia runtime (call to jl_f_getfield)
Stacktrace:
[1] #103 at /home/tbesard/Julia/CUDAnative/src/context.jl:51
julia> inner_kwargf(foobar;foo=1, bar=2) = nothing
inner_kwargf (generic function with 1 method)
julia> @cuda (()->inner_kwargf(42;foo=1,bar=2))()
ERROR: GPU compilation of #103() failed
KernelError: kernel returns a value of type `Any`
Some more obscure errors as well, but these are the obvious codegen-related ones. bors try
I really dislike the loss of method redefinition support though, so either we need a proper fix or a hack (like emptying the CUDAnative compile cache upon every REPL execution -- but we don't have a useful REPL API for that) to support redefinitions.
EDIT: even emptying the compile cache isn't sufficient, there's other caching going on
# valid def
julia> foo() = nothing
julia> @cuda foo()
# invalid def
julia> foo() = 1
julia> @cuda foo()
# works, too bad
# I expected this to fail
julia> empty!(CUDAnative.compilecache); @cuda foo()
# to show the def is really invalid
julia> bar() = 1
julia> @cuda bar()
ERROR: GPU compilation of #103() failed
KernelError: kernel returns a value of type `Int64`
bors try
I agree that losing the ability to redefine is annoying.
Regarding:
julia> function kernel1(T, i)
sink(i)
return
end
kernel1 (generic function with 1 method)
julia> @cuda kernel1(Int, 1)
ERROR: InvalidIRError: compiling #103(Type{Int64}, Int64) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f_tuple)
Stacktrace:
[1] #103 at /home/tbesard/Julia/CUDAnative/src/context.jl:51
Reason: unsupported call to the Julia runtime (call to jl_f_getfield)
Stacktrace:
[1] #103 at /home/tbesard/Julia/CUDAnative/src/context.jl:51
This issue is that Cassette places a call to overdub(cudactx, Main.sink, i), which causes the jl_f_tuple to appear. Not sure how to fix this.
argument count: Error During Test at /builds/JuliaGPU/CUDAnative.jl/test/device/execution.jl:440
Got exception outside of a @test
InvalidIRError: compiling #103(Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f__apply)
Stacktrace:
[1] #103 at /builds/JuliaGPU/CUDAnative.jl/src/context.jl:56
Stacktrace:
[1] check_ir(::CUDAnative.Com
Looks like the tuple limit.
bors try
Ok that reduces it down to:
dummy- 265 for Cassette, https://github.com/jrevels/Cassette.jl/issues/6
- https://github.com/JuliaGPU/CUDAnative.jl/pull/334#issuecomment-466184514
- Traces now have
overdubin them, would be lovely to filter those out