julia icon indicating copy to clipboard operation
julia copied to clipboard

Deadlock doing I/O on a foreign thread while the main thread is blocked

Open maleadt opened this issue 1 year ago • 5 comments

Reduced from https://github.com/JuliaGPU/CUDA.jl/issues/2449 filed by @miniskar:

  • create a pthread on which we call back into Julia and run a command
  • have the main thread block on pthread_join
  • the foreign thread cannot make progress because of some lock being held
#include <julia.h>
#include <pthread.h>

typedef void (*julia_callback)();

void *thread_function(void* callback) {
    printf("Calling Julia from thread\n");
    ((julia_callback)callback)();
    return NULL;
}
void call_on_thread(julia_callback callback) {
    printf("Creating thread\n");
    pthread_t thread;
    pthread_create(&thread, NULL, thread_function, callback);
    pthread_join(thread, NULL);
}

// alternative version that doesn't use a foreign thread,
// and as a result doesn't deadlock
void call_directly(julia_callback callback) {
    printf("Calling Julia directly\n");
    callback();
}
function callback()::Cvoid
    println("Running a command")
    run(`echo 42`)
    return
end

function main()
    callback_ptr = @cfunction(callback, Cvoid, ())
    gc_state = @ccall(jl_gc_safe_enter()::Int8)
    ccall((:call_on_thread, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    @ccall(jl_gc_safe_leave(gc_state::Int8)::Cvoid)
    println("Done")
end

main()
❯ gcc -fPIC -shared -o wip.so wip.c -isystem $JULIA/include/julia -isystem /opt/cuda/include -L$JULIA/lib -ljulia -lpthread && \
  julia --project wip.jl
Creating thread
Calling Julia from thread
Running a command^C
[41055] signal 2: Interrupt
in expression starting at /home/tim/Julia/pkg/CUDA/wip.jl:16
unknown function (ip: 0x7b4dd01ada17)
pthread_cond_wait at /usr/lib/libc.so.6 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822
ijl_task_get_next at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-11/src/scheduler.c:584
poptask at ./task.jl:1012
wait at ./task.jl:1021
uv_write at ./stream.jl:1072
unsafe_write at ./stream.jl:1145
write at ./strings/io.jl:248 [inlined]
print at ./strings/io.jl:250 [inlined]
print at ./strings/io.jl:46
jl_apply at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-11/src/julia.h:2156 [inlined]
do_apply at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-11/src/builtins.c:831
println at ./strings/io.jl:75
println at ./coreio.jl:4
callback at /home/tim/Julia/pkg/CUDA/wip.jl:3

I wasn't even sure if this is guaranteed to work, but @vchuravy mentioned that marking the blocking ccall @gc_safe ought to be enough to not have Julia hold any locks when entering C, so filing this as an issue here.

cc @vtjnash

maleadt avatar Aug 19 '24 07:08 maleadt

The deadlock is because of IO. The foreign thread is waiting for IO to happen while the thread that can run IO is blocked.

gbaraldi avatar Aug 19 '24 15:08 gbaraldi

Ah, right, so https://github.com/JuliaLang/julia/pull/50880 would fix this?

maleadt avatar Aug 19 '24 15:08 maleadt

That's what I´m experimenting on

gbaraldi avatar Aug 19 '24 15:08 gbaraldi

I thought we had an issue specifically about this already, but I don't see the specific way of handling pthread_join mentioned in https://github.com/JuliaLang/julia/issues/47201 currently, only in the parent issue that spawned it

vtjnash avatar Aug 19 '24 15:08 vtjnash

I don't think we have a way to detect this. Nobody is holding the io lock. The issue is that nobody is running the IO. The only way to not deadlock here is to have another thread run the IO. But if we deadlock here then it's too late.

gbaraldi avatar Aug 19 '24 15:08 gbaraldi

Is this issue fixed in some branch?

miniskar avatar Sep 09 '24 14:09 miniskar

Thank you Nash for the fix. I have verified this experimental fix for the MWE code and also with my application code using CUDA GPUs through pthreads as well. It is working good.

May I know the plan to release this fix to upcoming release?

miniskar avatar Dec 05 '24 17:12 miniskar