ThreadsX.jl
ThreadsX.jl copied to clipboard
QuickSort hangs
In @chriselrod's machine, running the quicksort benchmark hangs with 8 threads. When it hangs, the CPU usage 0%. Originally reported here: https://discourse.julialang.org/t/ann-threadsx-jl-parallelized-base-functions/36666/14
At the moment, it seems that the MWE is while true; @btime ThreadsX.sort($(rand(0:0.01:1, 1_000_000))); end.
Stacktrace:
[ Info: Running: `scaling_nthreads_target.jl` with 8 thread(s) [160/1900]
[ Info: Benchmark: sum_sin
[ Info: Parameter: (datasize = 1000000,)
[ Info: Parameter: (datasize = 10000000,)
[ Info: Benchmark: foreach_symmetrize
[ Info: Parameter: (datasize = 2000,)
[ Info: Parameter: (datasize = 6000,)
[ Info: Benchmark: sort
[ Info: Parameter: (datasize = 1000000, distribution = "wide", alg = "ThreadsX.MergeSort", basesize = nothing, smallsize = nothing)
[ Info: Parameter: (datasize = 10000000, distribution = "wide", alg = "ThreadsX.MergeSort", basesize = nothing, smallsize = nothing)
[ Info: Parameter: (datasize = 1000000, distribution = "narrow", alg = "ThreadsX.MergeSort", basesize = nothing, smallsize = nothing)
[ Info: Parameter: (datasize = 10000000, distribution = "narrow", alg = "ThreadsX.MergeSort", basesize = nothing, smallsize = nothing)
[ Info: Parameter: (datasize = 1000000, distribution = "wide", alg = "ThreadsX.QuickSort", basesize = nothing, smallsize = nothing)
[ Info: Parameter: (datasize = 10000000, distribution = "wide", alg = "ThreadsX.QuickSort", basesize = nothing, smallsize = nothing)
[ Info: Parameter: (datasize = 1000000, distribution = "narrow", alg = "ThreadsX.QuickSort", basesize = nothing, smallsize = nothing)
^CERROR: ERROR: InterruptException:
Stacktrace:LoadError:
[1] poptaskref(::Base.InvasiveLinkedListSynchronized{Task}) at ./task.jl:702
[2] wait at ./task.jl:709 [inlined]
[3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
[4] wait(::Base.Process) at ./process.jl:622
[5] success at ./process.jl:483 [inlined]
[6] run(::Cmd; wait::Bool) at ./process.jl:440
[7] run at ./process.jl:438 [inlined]
[8] runscript(::String, ::Array{String,1}; env::Array{Pair{String,String},1}) at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/ThreadsXBenchmarks.jl:52
[9] run_nthreads(::String; nthreads_range::UnitRange{Int64}) at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/ThreadsXBenchmarks.jl:82
[10] #run_all#10 at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/ThreadsXBenchmarks.jl:112 [inlined]
[11] run_all(::String) at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/ThreadsXBenchmarks.jl:112
[12] top-level scope at REPL[3]:1
[13] eval(::Module, ::Any) at ./boot.jl:331
[14] eval_user_input(::Any, ::REPL.REPLBackend) at /home/chriselrod/Documents/languages/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:130
[15] run_backend(::REPL.REPLBackend) at /home/chriselrod/.julia/packages/Revise/2K7IK/src/Revise.jl:1070
[16] top-level scope at none:0
julia> InterruptException:
Stacktrace:
[1] poptaskref(::Base.InvasiveLinkedListSynchronized{Task}) at ./task.jl:702
[2] wait at ./task.jl:709 [inlined]
[3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
[4] _wait(::Task) at ./task.jl:238
[5] sync_end(::Array{Any,1}) at ./task.jl:294
[6] macro expansion at ./task.jl:335 [inlined]
[7] maptasks(::ThreadsX.Implementations.var"#97#98"{Float64,Base.Order.ForwardOrdering}, ::Base.Iterators.Zip{Tuple{Base.Iterators.PartitionIterator{SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}},Base.Iterators.PartitionIterator{Array{Int8,1}}}}) at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
[8] _quicksort!(::Array{Float64,1}, ::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Int64,Int64}, ::Base.Order.ForwardOrdering, ::Array{Int8,1}, ::Bool, ::Bool) at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
[9] sort!(::Array{Float64,1}, ::Int64, ::Int64, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Nothing,Int64}, ::Base.Order.ForwardOrdering) at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:22
[10] _sort! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:130 [inlined]
[11] #sort!#86 at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:170 [inlined]
[12] #3 at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:21 [inlined]
[13] ##core#345(::NamedTuple{(:datasize, :dist, :alg, :basesize, :smallsize),Tuple{Int64,NamedTuple{(:label, :value),Tuple{String,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}},NamedTuple{(:label, :value),Tuple{String,ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Nothing,Int64}}},Nothing,Nothing}}, ::MersenneTwister, ::Arr
ay{Float64,1}, ::var"#3#14") at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371
[14] ##sample#346(::BenchmarkTools.Parameters) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:379
[15] sample at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:394 [inlined]
[16] _lineartrial(::BenchmarkTools.Benchmark{Symbol("##benchmark#344")}, ::BenchmarkTools.Parameters; maxevals::Int64, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:133
[17] _lineartrial(::BenchmarkTools.Benchmark{Symbol("##benchmark#344")}, ::BenchmarkTools.Parameters) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:125
[18] #invokelatest#1 at ./essentials.jl:710 [inlined]
[19] invokelatest at ./essentials.jl:709 [inlined]
[20] #lineartrial#38 at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:33 [inlined]
[21] lineartrial at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:33 [inlined]
[22] tune!(::BenchmarkTools.Benchmark{Symbol("##benchmark#344")}, ::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:209
[23] macro expansion at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:116 [inlined]
[24] macro expansion at /home/chriselrod/.julia/packages/ProgressLogging/g8xnW/src/ProgressLogging.jl:470 [inlined]
[25] macro expansion at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:104 [inlined]
[26] macro expansion at /home/chriselrod/.julia/packages/ProgressLogging/g8xnW/src/ProgressLogging.jl:470 [inlined]
[27] main(::Array{String,1}; benchmark_definitions::Dict{Symbol,NamedTuple{(:prepare, :run, :paramasdata, :paramaxes),T} where T<:Tuple}, scriptname::String, tags::Array{String,1}) at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:100
[28] main at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:96 [inlined] (repeats 2 times)
[29] top-level scope at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:144
[30] include(::String) at ./client.jl:442
[31] top-level scope at none:10
in expression starting at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:143
@chriselrod I'm wondering if the hang was data-dependent. If that's the case, can you get the hang more quickly by
seed = Ref(0)
@btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(seed[] += 1), 0:0.01:1, 1_000_000))
and then using the same seed
@btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(seed[]), 0:0.01:1, 1_000_000))
?
Also, it would be great if you can try debug branch #103 of ThreadsX and run one of the MWEs to cause the hang. I'm wondering if one of the tasks threw something.
julia> while true; @btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)); end
...
seed[] += 1 = 34162
seed[] += 1 = 34163
seed[] += 1 = 34164
seed[] += 1 = 34165
seed[] += 1 = 34166
seed[] += 1 = 34167
seed[] += 1 = 34168
seed[] += 1 = 34169
seed[] += 1 = 34170
seed[] += 1 = 34171
seed[] += 1 = 34172
seed[] += 1 = 34173
seed[] += 1 = 34174
seed[] += 1 = 34175
seed[] += 1 = 34176
seed[] += 1 = 34177
seed[] += 1 = 34178
seed[] += 1 = 34179
seed[] += 1 = 34180
seed[] += 1 = 34181
seed[] += 1 = 34182
seed[] += 1 = 34183
seed[] += 1 = 34184
seed[] += 1 = 34185
seed[] += 1 = 34186
seed[] += 1 = 34187
seed[] += 1 = 34188
seed[] += 1 = 34189
seed[] += 1 = 34190
seed[] += 1 = 34191
seed[] += 1 = 34192
seed[] += 1 = 34193
seed[] += 1 = 34194
seed[] += 1 = 34195
seed[] += 1 = 34196
seed[] += 1 = 34197
seed[] += 1 = 34198
seed[] += 1 = 34199
^Cfatal: error thrown and no exception handler available.
InterruptException()
jl_mutex_unlock at /home/chriselrod/Documents/languages/julia/src/locks.h:143 [inlined]
jl_task_get_next at /home/chriselrod/Documents/languages/julia/src/partr.c:441
poptaskref at ./task.jl:702
wait at ./task.jl:709 [inlined]
task_done_hook at ./task.jl:444
jl_apply at /home/chriselrod/Documents/languages/julia/src/julia.h:1685 [inlined]
jl_finish_task at /home/chriselrod/Documents/languages/julia/src/task.c:198
start_task at /home/chriselrod/Documents/languages/julia/src/task.c:697
unknown function (ip: (nil))
It finally hung with seed[] += 1 returned 34199.
Starting a new Julia session after the crash (it crashes when I ctrl-C)...
julia> using ThreadsX, BenchmarkTools, Random
julia> seed = Ref(34199)
Base.RefValue{Int64}(34199)
julia> @btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(seed[]), 0:0.01:1, 1_000_000));
2.749 ms (21066 allocations: 10.11 MiB)
Thanks for trying it out! So it's more like a timing thing...?
I asked people in #multithreading slack channel if they have some ideas about this.
Have you tried to reproduce with
using BenchmarkTools, ThreadsX, Random
seed = Ref(0);
while true; @btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)); end
?
I tried it a bit and then switched to while true; @btime ThreadsX.sort($(rand(0:0.01:1, 1_000_000))); end after seeing your last comment, as I thought it's more about the scheduler. So far no hang for me.
Do you mind sharing lscpu output, by the way?
it crashes when I ctrl-C
I couldn't get the hang but I get this part, too. Meaning that ctrl-C during the while loop killed Julia REPL.
No hang? Interesting. For how long have you been running it?
Normally it doesn't take too long for it to hang for me. The last three were after 10, 8, and 34 succesful @btimes.
I can provide the full lscpu if you need it, but for now I truncated the details on vulnerabilities and flags:
> lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 36
On-line CPU(s) list: 0-35
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
Stepping: 7
CPU MHz: 4080.897
CPU max MHz: 4600.0000
CPU min MHz: 1200.0000
BogoMIPS: 6000.00
Virtualization: VT-x
L1d cache: 576 KiB
L1i cache: 576 KiB
L2 cache: 18 MiB
L3 cache: 24.8 MiB
NUMA node0 CPU(s): 0-35
It doesn't always crash. I just got:
^CERROR: InterruptException:
Stacktrace:
[1] poptaskref(::Base.InvasiveLinkedListSynchronized{Task}) at ./task.jl:702
[2] wait at ./task.jl:709 [inlined]
[3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
[4] _wait(::Task) at ./task.jl:238
[5] sync_end(::Array{Any,1}) at ./task.jl:294
[6] macro expansion at ./task.jl:335 [inlined]
[7] _quicksort!(::Array{Float64,1}, ::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Int64,Int64}, ::Base.Order.ForwardOrdering, ::Array{Int8,1}, ::Bool, ::Bool) at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:97
[8] sort!(::Array{Float64,1}, ::Int64, ::Int64, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Nothing,Int64}, ::Base.Order.ForwardOrdering) at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:22
[9] _sort! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:130 [inlined]
[10] #sort!#86 at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:170 [inlined]
[11] sort! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:156 [inlined]
[12] ##core#354(::Array{Float64,1}) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371
[13] ##sample#355(::BenchmarkTools.Parameters) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:377
[14] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#353")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:411
[15] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#353")}, ::BenchmarkTools.Parameters) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:399
[16] #invokelatest#1 at ./essentials.jl:710 [inlined]
[17] invokelatest at ./essentials.jl:709 [inlined]
[18] #run_result#37 at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined]
[19] run_result at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined] (repeats 2 times)
[20] top-level scope at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:483
[21] top-level scope at REPL[16]:1
[22] eval(::Module, ::Any) at ./boot.jl:331
[23] eval_user_input(::Any, ::REPL.REPLBackend) at /home/chriselrod/Documents/languages/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:130
[24] run_backend(::REPL.REPLBackend) at /home/chriselrod/.julia/packages/Revise/2K7IK/src/Revise.jl:1070
[25] top-level scope at none:0
This is an issue for crashing on the interrupt exceptions: https://github.com/JuliaLang/julia/issues/34184
No hang? Interesting. For how long have you been running it?
Sadly no. I ran ~40 min last time and then gave up.
I can provide the full lscpu if you need it, but for now I truncated the details on vulnerabilities and flags:
Thanks, this is enough. I was mostly wondering NUMA topology as I saw some strange crash before when using multiple NUMA nodes (not sure if it's related at all though).
Ah, I was actually using julia nightly I downloaded a while ago (1.5.0-DEV.464). Now I switched to 1.5.0-DEV.526, I get the hang! I get it with JULIA_NUM_THREADS=13 but not with JULIA_NUM_THREADS=8 so far.
@chriselrod Thanks for the help! Sorry for the extra works that could be avoided if I were a bit more cautious.
Did you try JULIA_NUM_THREADS=13 with 1.5.0-DEV.464?
I wanted to bisect, so I jumped back to 1.5.0-DEV.460, and got the hang there.
You are right. I tried to bisect the bug and it hangs with 1.5.0-DEV.465 (and so presumably with 1.5.0-DEV.464) too.
While trying bisection I noticed that it can took more than 20 minutes sometimes to get a hang in my machine. So maybe I was too impatient before. I also switched to docker to run this (since the host OS is too old to build julia) so it may affect things somewhat... Anyway, I'll try to run this over night with a larger range while no one is using the machine.
Bisection script (used with git bisect run env JULIA_NUM_THREADS=13 ./julia PATH/TO/script.jl):
run(`make -j$(Sys.CPU_THREADS)`)
code = raw"""
using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
@btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end
"""
const EXIT_CODE = Ref(2)
jl = run(
pipeline(`$(Base.julia_cmd()) -e "$code"`; stdout = stdout, stderr = stderr);
wait = false,
)
try
pid = getpid(jl)
start = time_ns()
top = open(pipeline(`top -b -d 10 -p$pid`; stderr = stderr))
try
unknown_line = 0
while true
local i
while true
ln = readline(top)
if match(r"^ *PID ", ln) != nothing
i = findfirst(==("%CPU"), split(ln))
if i == nothing
error("Cannot parse `top` header:\n", ln)
end
break
end
end
ln = readline(top)
columns = split(ln)
if string(pid) in columns
unknown_line = 0
pcpu = parse(Float64, columns[i])
@info "%CPU = $pcpu"
minutes = (time_ns() - start) / 1000^3 / 60
if minutes < 1
@info "Ignoring %CPU for first 1 minute... $((time_ns() - start) / 1000^3) seconds passed."
continue
end
if pcpu < 10.0
EXIT_CODE[] = 1
break
end
if minutes > 30
@info "More than 30 minutes without a hang."
EXIT_CODE[] = 0
break
end
@info "$minutes minutes without a hang...."
else
if jl.exitcode >= 0
error("Unexpected exit")
end
unknown_line > 10 && error("Too many parse failures")
unknown_line += 1
@error "Cannot parse `top` output. Continuing..." ln
end
end
finally
kill(top)
end
finally
kill(jl)
end
exit(EXIT_CODE[])
That's incredible -- really neat script!
I ran while true; @btime ThreadsX.sort($(rand(0:0.01:1, 1_000_000))); end for more than an half an hour without a problem. I picked some time on the master branch from back in November (where I again ran it for over half an hour without the issue) as the good starting point.
I'm running your script now, and will let you know how it goes.
Great! This is good news! Thanks for figuring out the version that it works. I did think the bug might be in the scheduler but if it weren't then I'd be hopelessly lost.
I modified the script to use ps instead of top, because my OS provides htop instead of top (i.e., /usr/bin/top is actually htop), which doesn't have the -b option, and it's output isn't really parse-able:
run(`make -j$(Sys.CPU_THREADS)`)
code = raw"""
using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
@btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end
"""
const EXIT_CODE = Ref(2)
jl = run(
pipeline(`$(Base.julia_cmd()) -e "$code"`; stdout = stdout, stderr = stderr);
wait = false,
)
try
pid = getpid(jl)
io = IOBuffer()
cmd = pipeline(`ps -p$pid -h -o %cpu`, stdout = io)
start = time_ns()
while true
sleep(10)
run(cmd)
pcpu = parse(Float64, String(take!(io)))
@info "%CPU = $pcpu"
minutes = (time_ns() - start) / 1000^3 / 60
if minutes < 1
@info "Ignoring %CPU for first 1 minute... $((time_ns() - start) / 1000^3) seconds passed."
continue
end
if pcpu < 10.0
EXIT_CODE[] = 1
break
end
if minutes > 30
@info "More than 30 minutes without a hang."
EXIT_CODE[] = 0
break
end
@info "$minutes minutes without a hang...."
end
finally
kill(jl)
end
exit(EXIT_CODE[])
Yeah,also fortunate that this bug doesn't seem to impact 1.4 / any released version of Julia.
Hmm... That's too bad that we can't rely on that top is really top.
I modified the script to use
psinstead oftop
This may be OS-dependent, but when I tried using ps before, its %cpu output was not really useful (not as accurate as top). See: https://unix.stackexchange.com/questions/58539/top-and-ps-not-showing-the-same-cpu-result
You're right, I just ran into this problem. It hung, but was declining slowly from >700% (with 8 threads). It wouldn't have gotten to <10% before the 30 minutes were up.
I think this may work:
run(`make -j$(Sys.CPU_THREADS)`)
code = raw"""
using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
@btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end
"""
const EXIT_CODE = Ref(2)
jl = run(
pipeline(`$(Base.julia_cmd()) -e "$code"`; stdout = stdout, stderr = stderr);
wait = false,
)
try
start = time_ns()
pid = getpid(jl)
io = IOBuffer()
cmd = pipeline(`ps -p$pid -h -o cputimes`, stdout = io)
current_cpu = last_cpu = 0.0;
last_time = current_time = 0.0;
while true
sleep(10)
run(cmd)
current_cpu = parse(Float64, String(take!(io)))
current_time = (time_ns() - start) * 1e-9
pcpu = 100.0 * (current_cpu - last_cpu) / (current_time - last_time)
@info "%CPU = $pcpu"
minutes = current_time / 60
if minutes < 1
@info "Ignoring %CPU for first 1 minute... $(current_time) seconds passed."
continue
end
if pcpu < 10.0
EXIT_CODE[] = 1
break
end
if minutes > 120
@info "More than 30 minutes without a hang."
EXIT_CODE[] = 0
break
end
@info "$minutes minutes without a hang...."
last_time = current_time; last_cpu = current_cpu;
end
finally
kill(jl)
end
exit(EXIT_CODE[])
Above it's set to go for 2 hours before deciding that things work. That's probably not long enough.
I decided to try the following commit a few times:
https://github.com/JuliaLang/julia/commit/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718
I got the deadlock at seed[] += 1 = 12, = 73, and now it's currently at 196 (after 33 minutes) without yet locking.
I think I'd need to run it for a very long time before being reasonably confident that it won't.
It may also be worth modifying the script to run multiple in parallel. I get the deadlock with 8 threads, and the CPU has 18 physical cores. It's probably worth running three instances of 8 threads in parallel, and declaring the commit bad if any lock up.
EDIT: Finally hung at 239. As a rough estimate from these 3 data points (assuming a geometric distribution), anything under 500 is plausible, so we'd have to run for at least that many before finding it likely that it wont deadlock eventually.
Could this be a problem?
signal (11): Segmentation fault
in expression starting at REPL[3]:1
- at ./int.jl:52 [inlined]
unsafe_length at ./range.jl:517 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
axes1 at ./abstractarray.jl:95 [inlined]
eachindex at ./abstractarray.jl:267 [inlined]
eachindex at ./abstractarray.jl:270 [inlined]
eachindex at ./abstractarray.jl:260 [inlined]
partition_sizes! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
unknown function (ip: 0x7f67b598804f)
unknown function (ip: (nil))
Allocations: 10113947179 (Pool: 10113915735; Big: 31444); GC: 29811
That is:
partition_sizes! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
There'd have to have been something weird with the subarrays.
EDIT: I'm running Julia with --check-bounds=yes to see if I eventually get a boundserror.
EDIT: As an update, with --check-bounds=yes, it's been running for about 2 and a half hours, having completed 861 @btimes without hanging, running into an error, or crashing.
I decided to try the following commit a few times: JuliaLang/julia@65b8e7e
Yes, my bisection points to https://github.com/JuliaLang/julia/pull/32599 (which includes https://github.com/JuliaLang/julia/commit/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718), too. After this PR (https://github.com/JuliaLang/julia/commit/dc46ddd857d9fcf178ee5074f635dc29c5b1ade6) I can get the hang but not before the PR (https://github.com/JuliaLang/julia/commit/f5dbc471efc3334a8c532f56fb40b06427fa52c7).
Somehow it was hard to get a hang with https://github.com/JuliaLang/julia/commit/dc46ddd857d9fcf178ee5074f635dc29c5b1ade6. It took 18 minutes to get a deadlock one time. I missed this commit the first time since it didn't cause the deadlock for 30 minutes. I ran the script for one hour for https://github.com/JuliaLang/julia/commit/f5dbc471efc3334a8c532f56fb40b06427fa52c7 once and it didn't cause the hang.
Which Julia revision you used to get the segmentation fault?
The last run with https://github.com/JuliaLang/julia/commit/f5dbc471efc3334a8c532f56fb40b06427fa52c7 (before JuliaLang/julia#32599) caused another segmentation fault
seed[] += 1 = 247
[ Info: %CPU = 1167.0
[ Info: 45.550472802933335 minutes without a hang....
signal (11): Segmentation fault
in expression starting at none:3
setindex! at ./Base.jl:0 [inlined]
setindex! at ./subarray.jl:301 [inlined]
partition_sizes! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:167
#97 at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:158 [inlined]
#3 at ./threadingconstructs.jl:126
unknown function (ip: 0x7f54cc050ee6)
_jl_invoke at /julia/src/gf.c:2144 [inlined]
jl_apply_generic at /julia/src/gf.c:2328
jl_apply at /julia/src/julia.h:1695 [inlined]
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 6273831512 (Pool: 6273798676; Big: 32836); GC: 17817
But, staring at the code,
https://github.com/tkf/ThreadsX.jl/blob/2e630bff52db2e8136e3607a1d44d86811723c1e/src/quicksort.jl#L160-L172
I don't know how it'd cause segfault.
Looking at the code in your stacktrace https://github.com/tkf/ThreadsX.jl/issues/104#issuecomment-606736472, I'm puzzled why it'd cause a segmentation fault. Is it possible that the stacktrace is broken?
- at ./int.jl:52
unsafe_length at ./range.jl:517
unsafe_indices at ./abstractarray.jl:99
_indices_sub at ./subarray.jl:409
axes at ./subarray.jl:404
axes1 at ./abstractarray.jl:95
eachindex at ./abstractarray.jl:267
eachindex at ./abstractarray.jl:270
eachindex at ./abstractarray.jl:260
I'm running the MWE with --check-bounds=yes and it's seed[] += 1 = 423 now. No hang or segfault so far.
Edit: I gave up after seed[] += 1 = 1077
So far, I think we discovered that:
- ThreadsX.jl and/or
juliabefore JuliaLang/julia#32599 have a bug that is manifested as a segfault when quicksort is run for a very long time. - PR JuliaLang/julia#32599 makes this bug manifested as a hang that can happen more easily than before the PR.
@chriselrod Do you agree?
Which Julia revision you used to get the segmentation fault?
That was 1.5.0-DEV.300. https://github.com/JuliaLang/julia/commit/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718 I have no idea why it would segfault either. In my ignorance, a broken stack trace seems possible.
Edit: I gave up after
seed[] += 1 = 1077
I gave up after seed[] += 1 = 1435. So it seems like --check-boounds=yes makes the problems much less likely (perhaps 0). I'm still using 8 threads, while you were using 13.
I'll try with other numbers of threads. I don't know how brittle this is.
I think you may be right, but it's really hard to be confident with how sporadic these hangs are.
EDIT:
I just got two hangs on https://github.com/JuliaLang/julia/commit/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718
at seed[] += 1 = 460 and = 470 with 8 and 18 threads, respectively.
So I just filed an issue in https://github.com/JuliaLang/julia/issues/35341 with some summary of the things we discovered so far.
I'm also trying to build julia with ASAN https://github.com/JuliaLang/julia/issues/35338#issuecomment-607550149 hoping that it finds some memory-related bug. But I don't know anything about LLVM or C++ so I'm not sure how much I can find.