Add a hostcall interface
Fixes https://github.com/JuliaGPU/CUDA.jl/issues/440
Initial, simple implementation. I still need to steal ideas from ADMGPU.jl and optimizations from https://github.com/JuliaGPU/CUDA.jl/pull/567, but the initial goal is a simple but correct implementation that we can use for unlikely code paths such as error reporting.
Demo:
julia> using CUDA
julia> function test(x)
println("This is a hostcall from thread $x")
x+1
end
test (generic function with 1 method)
julia> function kernel()
rv = hostcall(test, Int, Tuple{Int}, threadIdx().x)
@cuprintln("Hostcall returned $rv")
return
end
kernel (generic function with 1 method)
julia> @cuda threads=2 kernel();
This is a hostcall from thread 1
Hostcall returned 2
This is a hostcall from thread 2
Hostcall returned 3
Depends on https://github.com/JuliaGPU/CUDA.jl/pull/1110.
Probably requires Base support like https://github.com/JuliaLang/julia/pull/42302
cc @jpsamaroo
Codecov Report
Merging #1140 (35026b9) into master (5b74388) will increase coverage by
8.97%. The diff coverage is86.07%.
:exclamation: Current head 35026b9 differs from pull request most recent head 1fe2b4c. Consider uploading reports for the commit 1fe2b4c to get more accurate results
@@ Coverage Diff @@
## master #1140 +/- ##
==========================================
+ Coverage 66.97% 75.94% +8.97%
==========================================
Files 118 119 +1
Lines 7955 7737 -218
==========================================
+ Hits 5328 5876 +548
+ Misses 2627 1861 -766
| Impacted Files | Coverage Δ | |
|---|---|---|
| lib/cudadrv/types.jl | 83.33% <0.00%> (-16.67%) |
:arrow_down: |
| src/CUDA.jl | 100.00% <ø> (ø) |
|
| src/compiler/hostcall.jl | 85.48% <85.48%> (ø) |
|
| src/compiler/execution.jl | 84.61% <85.71%> (+0.54%) |
:arrow_up: |
| lib/cudadrv/execution.jl | 100.00% <100.00%> (+3.44%) |
:arrow_up: |
| src/compiler/exceptions.jl | 64.28% <100.00%> (-29.84%) |
:arrow_down: |
| src/compiler/gpucompiler.jl | 82.14% <100.00%> (-1.73%) |
:arrow_down: |
| examples/wmma/low-level.jl | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| examples/wmma/high-level.jl | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| src/linalg.jl | 36.36% <0.00%> (-50.01%) |
:arrow_down: |
| ... and 72 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 5b74388...1fe2b4c. Read the comment docs.
Hmm, one problem is that the following deadlocks:
# hostcall watcher task/thread
Threads.@spawn begin
while true
println(1)
sleep(1)
end
end
# the application, possibly getting stuck in a CUDA API call that needs the kernel to finish
while true
ccall(:sleep, Cuint, (Cuint,), 1)
end
I had expected this when running with a single thread, because the main task isn't preemtible, but even with multiple threads the main task getting stuck apparently blocks the scheduler, keeping the hostcall watcher thread from making progress. That would cause a deadlock. @vchuravy any thoughts? How does AMDGPU.jl solve this?
And for some preliminary time measurements:
julia> kernel() = hostcall(identity, Nothing, Tuple{Nothing}, nothing)
julia> @benchmark CUDA.@sync @cuda threads=1024 blocks=10 kernel()
BenchmarkTools.Trial: 79 samples with 1 evaluation.
Range (min … max): 23.918 ms … 103.041 ms ┊ GC (min … max): 0.00% … 2.35%
Time (median): 82.768 ms ┊ GC (median): 0.00%
Time (mean ± σ): 64.525 ms ± 31.968 ms ┊ GC (mean ± σ): 0.74% ± 1.92%
█
█▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▂▄▁▆▅▂▁▅▁▁▂▂▄ ▁
23.9 ms Histogram: frequency by time 101 ms <
So 2.25us 'per' hostcall (uncontended, and nonblocking since the call doesn't return anything). That's not great, but it's a start. I also don't want to build on this before I'm sure this won't deadlock applications.
And for reference, @cuprint and malloc (two calls that could be replaced by hostcall-based alternatives) are both an order of magnitude faster, but that's somewhat expected as both don't actually need to communicate with the CPU (printf uses a ring buffer and is happy to trample over unprocessed output, while malloc uses a fixed-size, preallocated buffer as the source for a bump allocator). Still, in the uncontended case (which basically is also a ring buffer) we should be able to do much better.
I had expected this when running with a single thread, because the main task isn't preemtible, but even with multiple threads the main task getting stuck apparently blocks the scheduler, keeping the hostcall watcher thread from making progress. That would cause a deadlock.
Are you sure you are blocking the scheduler or are you blocking GC? You need at least a safepoint in the loop
You need at least a safepoint in the loop
In which loop? The first does a sleep, so that's a yield point. The second loop doesn't need to be a loop, if could as well be an API call that blocks 'indefinitely'.
Seems to deadlock regularly on CI, so I guess this will have to wait unless we have either application threads, or a way to make CUDA's blocking API calls yield.