julia icon indicating copy to clipboard operation
julia copied to clipboard

Race Condition: `Test.@test_logs` is not thread safe.

Open NHDaly opened this issue 1 month ago • 1 comments

EDIT: changed to a much simpler example that shows the issue.

The @test_logs macro is not thread safe, if the code under test uses multiple threads:

julia> function foo(n)
           @info "Doing foo with n=$n"
           @sync for i=1:n
               Threads.@spawn @info "Iteration $i"
           end
           42
       end
foo (generic function with 1 method)

julia> for _ in 1:1000
           @test_logs (:info,"Doing foo with n=10000") match_mode=:any foo(10_000)
       end
julia+RAI(56155,0x1f5157ac0) malloc: double free for ptr 0x128248000
julia+RAI(56155,0x1f5157ac0) malloc: *** set a breakpoint in malloc_error_break to debug

[56155] signal (6): Abort trap: 6
in expression starting at REPL[8]:1

signal (6) thread (1) __pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 54370881 (Pool: 54363911; Big: 6970); GC: 119
[1]    56154 abort      julia -tauto

This is definitely a bug, and should be fixed.

We see this showing up in our CI build farm from time to time.

The solution is almost certainly to make that vector thread-safe, by introducing a lock around it, and then copying the vector out before returning.

NHDaly avatar May 10 '24 20:05 NHDaly

EDIT: I've edited the original message to the much more compelling example @test_logs example. Moving the previous description here, as a secondary example:

Any Task spawned during a Test, which logs after Test.collect_test_logs() returns can trigger a data race, where the spawned task logs into the logs vector, while the caller is reading from that vector.

Here is a contrived example that I managed to get to reproduce the issue:

julia> function code_to_test()
           # do the work
           @info "starting"
           start_transaction()
           # ...
           @info "done"
           # Clean up
           end_transaction()
       end
code_to_test (generic function with 1 method)

julia> start_transaction() = nothing
start_transaction (generic function with 1 method)

julia> function end_transaction()
           # spawn a task to do some cleanup in the background, so we can
           # return to the client immediately
           Threads.@spawn begin
               # clean stuff up
               @info "Background Cleanup Task: All done cleaning up!"
               for _ in 1:1000
                   @info "now it's clean, goodbye"
               end
           end
       end
end_transaction (generic function with 1 method)

julia> using Test

julia> for _ in 1:10000
           let (logs,_) = Test.collect_test_logs() do
                   @info "hey"
                   code_to_test()
                   @warn "hi"
               end

               for log in logs
                   @test length(log.message) > 1
               end
           end
       end

ERROR: UndefRefError: access to undefined reference
Stacktrace:
 [1] getindex
   @ ./essentials.jl:13 [inlined]
 [2] iterate(A::Vector{LogRecord}, i::Int64)
   @ Base ./array.jl:945
 [3] top-level scope
   @ REPL[5]:10

This also sometimes segfaults, of course:

[66213] signal (11.2): Segmentation fault: 11
in expression starting at REPL[5]:1
signal (11) thread (1) sig_match_fast at /Users/nathandaly/builds/julia-RAI/src/gf.c:2914 [inlined]
signal (11) thread (1) jl_lookup_generic_ at /Users/nathandaly/builds/julia-RAI/src/gf.c:2994 [inlined]
signal (11) thread (1) ijl_apply_generic at /Users/nathandaly/builds/julia-RAI/src/gf.c:3073

We see this showing up in our CI build farm from time to time.

The solution is almost certainly to make that vector thread-safe, by introducing a lock around it, and then copying the vector out before returning.

NHDaly avatar May 16 '24 16:05 NHDaly