CUDA.jl icon indicating copy to clipboard operation
CUDA.jl copied to clipboard

Add a hostcall interface

Open maleadt opened this issue 4 years ago • 6 comments

Fixes https://github.com/JuliaGPU/CUDA.jl/issues/440

Initial, simple implementation. I still need to steal ideas from ADMGPU.jl and optimizations from https://github.com/JuliaGPU/CUDA.jl/pull/567, but the initial goal is a simple but correct implementation that we can use for unlikely code paths such as error reporting.

Demo:

julia> using CUDA

julia> function test(x)
         println("This is a hostcall from thread $x")
         x+1
       end
test (generic function with 1 method)

julia> function kernel()
         rv = hostcall(test, Int, Tuple{Int}, threadIdx().x)
         @cuprintln("Hostcall returned $rv")
         return
       end
kernel (generic function with 1 method)

julia> @cuda threads=2 kernel();
This is a hostcall from thread 1
Hostcall returned 2
This is a hostcall from thread 2
Hostcall returned 3

Depends on https://github.com/JuliaGPU/CUDA.jl/pull/1110.

Probably requires Base support like https://github.com/JuliaLang/julia/pull/42302

cc @jpsamaroo

maleadt avatar Sep 09 '21 12:09 maleadt

Codecov Report

Merging #1140 (35026b9) into master (5b74388) will increase coverage by 8.97%. The diff coverage is 86.07%.

:exclamation: Current head 35026b9 differs from pull request most recent head 1fe2b4c. Consider uploading reports for the commit 1fe2b4c to get more accurate results Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1140      +/-   ##
==========================================
+ Coverage   66.97%   75.94%   +8.97%     
==========================================
  Files         118      119       +1     
  Lines        7955     7737     -218     
==========================================
+ Hits         5328     5876     +548     
+ Misses       2627     1861     -766     
Impacted Files Coverage Δ
lib/cudadrv/types.jl 83.33% <0.00%> (-16.67%) :arrow_down:
src/CUDA.jl 100.00% <ø> (ø)
src/compiler/hostcall.jl 85.48% <85.48%> (ø)
src/compiler/execution.jl 84.61% <85.71%> (+0.54%) :arrow_up:
lib/cudadrv/execution.jl 100.00% <100.00%> (+3.44%) :arrow_up:
src/compiler/exceptions.jl 64.28% <100.00%> (-29.84%) :arrow_down:
src/compiler/gpucompiler.jl 82.14% <100.00%> (-1.73%) :arrow_down:
examples/wmma/low-level.jl 0.00% <0.00%> (-100.00%) :arrow_down:
examples/wmma/high-level.jl 0.00% <0.00%> (-100.00%) :arrow_down:
src/linalg.jl 36.36% <0.00%> (-50.01%) :arrow_down:
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5b74388...1fe2b4c. Read the comment docs.

codecov[bot] avatar Sep 09 '21 13:09 codecov[bot]

Hmm, one problem is that the following deadlocks:

# hostcall watcher task/thread
Threads.@spawn begin
    while true
        println(1)
        sleep(1)
    end
end

# the application, possibly getting stuck in a CUDA API call that needs the kernel to finish
while true
    ccall(:sleep, Cuint, (Cuint,), 1)
end

I had expected this when running with a single thread, because the main task isn't preemtible, but even with multiple threads the main task getting stuck apparently blocks the scheduler, keeping the hostcall watcher thread from making progress. That would cause a deadlock. @vchuravy any thoughts? How does AMDGPU.jl solve this?

maleadt avatar Sep 09 '21 13:09 maleadt

And for some preliminary time measurements:

julia> kernel() = hostcall(identity, Nothing, Tuple{Nothing}, nothing)

julia> @benchmark CUDA.@sync @cuda threads=1024 blocks=10 kernel()
BenchmarkTools.Trial: 79 samples with 1 evaluation.
 Range (min … max):  23.918 ms … 103.041 ms  ┊ GC (min … max): 0.00% … 2.35%
 Time  (median):     82.768 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   64.525 ms ±  31.968 ms  ┊ GC (mean ± σ):  0.74% ± 1.92%

  █                                                             
  █▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▂▄▁▆▅▂▁▅▁▁▂▂▄ ▁
  23.9 ms         Histogram: frequency by time          101 ms <

So 2.25us 'per' hostcall (uncontended, and nonblocking since the call doesn't return anything). That's not great, but it's a start. I also don't want to build on this before I'm sure this won't deadlock applications.

And for reference, @cuprint and malloc (two calls that could be replaced by hostcall-based alternatives) are both an order of magnitude faster, but that's somewhat expected as both don't actually need to communicate with the CPU (printf uses a ring buffer and is happy to trample over unprocessed output, while malloc uses a fixed-size, preallocated buffer as the source for a bump allocator). Still, in the uncontended case (which basically is also a ring buffer) we should be able to do much better.

maleadt avatar Sep 10 '21 13:09 maleadt

I had expected this when running with a single thread, because the main task isn't preemtible, but even with multiple threads the main task getting stuck apparently blocks the scheduler, keeping the hostcall watcher thread from making progress. That would cause a deadlock.

Are you sure you are blocking the scheduler or are you blocking GC? You need at least a safepoint in the loop

vchuravy avatar Sep 30 '21 16:09 vchuravy

You need at least a safepoint in the loop

In which loop? The first does a sleep, so that's a yield point. The second loop doesn't need to be a loop, if could as well be an API call that blocks 'indefinitely'.

maleadt avatar Sep 30 '21 16:09 maleadt

Seems to deadlock regularly on CI, so I guess this will have to wait unless we have either application threads, or a way to make CUDA's blocking API calls yield.

maleadt avatar Oct 04 '21 08:10 maleadt