Fixes https://github.com/JuliaGPU/CUDA.jl/issues/440

Initial, simple implementation. I still need to steal ideas from ADMGPU.jl and optimizations from https://github.com/JuliaGPU/CUDA.jl/pull/567, but the initial goal is a simple but correct implementation that we can use for unlikely code paths such as error reporting.

Demo:

julia> using CUDA

julia> function test(x)
         println("This is a hostcall from thread $x")
         x+1
       end
test (generic function with 1 method)

julia> function kernel()
         rv = hostcall(test, Int, Tuple{Int}, threadIdx().x)
         @cuprintln("Hostcall returned $rv")
         return
       end
kernel (generic function with 1 method)

julia> @cuda threads=2 kernel();
This is a hostcall from thread 1
Hostcall returned 2
This is a hostcall from thread 2
Hostcall returned 3

Depends on https://github.com/JuliaGPU/CUDA.jl/pull/1110.

Probably requires Base support like https://github.com/JuliaLang/julia/pull/42302

cc @jpsamaroo

Sep 09 '21 12:09 maleadt

Codecov Report

Merging #1140 (35026b9) into master (5b74388) will increase coverage by 8.97%. The diff coverage is 86.07%.

:exclamation: Current head 35026b9 differs from pull request most recent head 1fe2b4c. Consider uploading reports for the commit 1fe2b4c to get more accurate results

@@            Coverage Diff             @@
##           master    #1140      +/-   ##
==========================================
+ Coverage   66.97%   75.94%   +8.97%     
==========================================
  Files         118      119       +1     
  Lines        7955     7737     -218     
==========================================
+ Hits         5328     5876     +548     
+ Misses       2627     1861     -766

Impacted Files	Coverage Δ
lib/cudadrv/types.jl	`83.33% <0.00%> (-16.67%)`	:arrow_down:
src/CUDA.jl	`100.00% <ø> (ø)`
src/compiler/hostcall.jl	`85.48% <85.48%> (ø)`
src/compiler/execution.jl	`84.61% <85.71%> (+0.54%)`	:arrow_up:
lib/cudadrv/execution.jl	`100.00% <100.00%> (+3.44%)`	:arrow_up:
src/compiler/exceptions.jl	`64.28% <100.00%> (-29.84%)`	:arrow_down:
src/compiler/gpucompiler.jl	`82.14% <100.00%> (-1.73%)`	:arrow_down:
examples/wmma/low-level.jl	`0.00% <0.00%> (-100.00%)`	:arrow_down:
examples/wmma/high-level.jl	`0.00% <0.00%> (-100.00%)`	:arrow_down:
src/linalg.jl	`36.36% <0.00%> (-50.01%)`	:arrow_down:
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5b74388...1fe2b4c. Read the comment docs.

Sep 09 '21 13:09 codecov[bot]

Hmm, one problem is that the following deadlocks:

# hostcall watcher task/thread
Threads.@spawn begin
    while true
        println(1)
        sleep(1)
    end
end

# the application, possibly getting stuck in a CUDA API call that needs the kernel to finish
while true
    ccall(:sleep, Cuint, (Cuint,), 1)
end

I had expected this when running with a single thread, because the main task isn't preemtible, but even with multiple threads the main task getting stuck apparently blocks the scheduler, keeping the hostcall watcher thread from making progress. That would cause a deadlock. @vchuravy any thoughts? How does AMDGPU.jl solve this?

Sep 09 '21 13:09 maleadt

And for some preliminary time measurements:

julia> kernel() = hostcall(identity, Nothing, Tuple{Nothing}, nothing)

julia> @benchmark CUDA.@sync @cuda threads=1024 blocks=10 kernel()
BenchmarkTools.Trial: 79 samples with 1 evaluation.
 Range (min … max):  23.918 ms … 103.041 ms  ┊ GC (min … max): 0.00% … 2.35%
 Time  (median):     82.768 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   64.525 ms ±  31.968 ms  ┊ GC (mean ± σ):  0.74% ± 1.92%

  █                                                             
  █▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▂▄▁▆▅▂▁▅▁▁▂▂▄ ▁
  23.9 ms         Histogram: frequency by time          101 ms <

So 2.25us 'per' hostcall (uncontended, and nonblocking since the call doesn't return anything). That's not great, but it's a start. I also don't want to build on this before I'm sure this won't deadlock applications.

And for reference, @cuprint and malloc (two calls that could be replaced by hostcall-based alternatives) are both an order of magnitude faster, but that's somewhat expected as both don't actually need to communicate with the CPU (printf uses a ring buffer and is happy to trample over unprocessed output, while malloc uses a fixed-size, preallocated buffer as the source for a bump allocator). Still, in the uncontended case (which basically is also a ring buffer) we should be able to do much better.

Sep 10 '21 13:09 maleadt

I had expected this when running with a single thread, because the main task isn't preemtible, but even with multiple threads the main task getting stuck apparently blocks the scheduler, keeping the hostcall watcher thread from making progress. That would cause a deadlock.

Are you sure you are blocking the scheduler or are you blocking GC? You need at least a safepoint in the loop

Sep 30 '21 16:09 vchuravy

You need at least a safepoint in the loop

In which loop? The first does a sleep, so that's a yield point. The second loop doesn't need to be a loop, if could as well be an API call that blocks 'indefinitely'.

Sep 30 '21 16:09 maleadt

Seems to deadlock regularly on CI, so I guess this will have to wait unless we have either application threads, or a way to make CUDA's blocking API calls yield.

Oct 04 '21 08:10 maleadt

Add a hostcall interface

Codecov Report