KernelAbstractions.jl icon indicating copy to clipboard operation
KernelAbstractions.jl copied to clipboard

@print functionality inconsistent with CUDADevice()

Open leios opened this issue 5 years ago • 6 comments

This is more for my own reference later.

If I create the following kernel:

@kernel function check_kernel()

    cI = @index(Global)

    if cI == 2
        @print("cI is: ", cI, '\n')
    end
end

function check(use_gpu)
    if use_gpu
        kernel! = check_kernel(CUDADevice(), 256)
    else
        kernel! = check_kernel(CPU(),8)
    end
    kernel!(ndrange=256) 
end

I find odd behavior in the REPL:

julia> include("src/check.jl")
cI is: 2KernelAbstractions
.CudaEvent(CuEvent(Ptr{Nothing} @0x000055d1cee5f4c0, CuContext(Ptr{Nothing} @0x000055d1ce966110)))

julia> wait(check(true))

julia> wait(check(false))
cI is: 2

julia> exit()
cI is: 2

For some reason, the printing for kernels run on a CUDADevice() will only print on exiting the REPL

I guess there is some way to make the wait() function also prompt the @print macro, but it might take some digging.

leios avatar Jul 10 '20 15:07 leios

This buffer is flushed only for

the start of a kernel launch
synchronization (e.g. cudaDeviceSynchronize())
blocking memory copies (e.g. cudaMemcpy(...))
module load/unload
context destruction

An important thing to note is that this list does not include program exit. If the call to cudaDeviceSynchronize() was removed from the example program above, the we would see no output.

http://15418.courses.cs.cmu.edu/spring2013/article/15

vchuravy avatar Jul 10 '20 15:07 vchuravy

Ok, bear with me.

I added @synchronize() to the kernel and it didn't seem to do anything different. I then updated the @print macros cuda backend to also have a synchronize.

If I run the check() function n times, I get n outputs on julia exit, which leads me to believe this is more of a julia thing than a cuda thing.

Could it be possible for the print macro to send the output to another printing thread that @show doesn't see?

leios avatar Jul 10 '20 15:07 leios

It is definitely a cuda thing. The only thing we do is to call vnprintf on the device, the output is then managed by CUDA. The synchronize mentioned on the website I found is a CUDA.synchronize() which performs a device level synchronization, not a warp level synchronization like @synchronize. The reason why you get the output on the exit from Julia is probably because we intentionally deconstruct the context.

vchuravy avatar Jul 10 '20 16:07 vchuravy

Ok, but it looks like wait already called that synchronize.

From the cuda backend:

function wait(::CPU, ev::CudaEvent, progress=yield)
    if progress === nothing
        CUDA.synchronize(ev.event)
    else
        while !isdone(ev)
            progress()
        end
    end
end

I was trying CUDA.sync_threads() in the overdubbed print macro, which I thought would do the same thing?

leios avatar Jul 10 '20 16:07 leios

I was trying CUDA.sync_threads() in the overdubbed print macro, which I thought would do the same thing?

No that is still a within kernel synchronize.

CUDA.synchronize(ev.event) is an event synchronize whereas I think you need a https://github.com/JuliaGPU/CUDA.jl/blob/4ea9cabea7ab7da58b7bf73d946b60a604e3b445/lib/cudadrv/context.jl#L151 CUDA.synchronize().

vchuravy avatar Jul 10 '20 16:07 vchuravy

Ah, tried that one too, but got a spectacular error message. I'll poke around more after* an upcoming meeting

  • after = during.

leios avatar Jul 10 '20 16:07 leios