julia
julia copied to clipboard
Thread starvation with channels and short cpu-bound tasks
The following code computes fibonacci(35) where each worker reads from a channel of repeated 35’s, and writes the answer to a channel, the main task reads from the answer channel.
The results below are with Julia I built locally from git, but I also checked 1.8, 1.7, 1.6, and they seem to all produce the same results.
versioninfo()
using Base.Threads
# bufsize = (0, 0)
# bufsize = (1, 1)
# bufsize = (Inf, 1)
# bufsize = (Inf, 2)
# bufsize = (Inf, nthreads())
N = 35
function fib(n::Int)
n ≤ 1 && return n
# yield()
return fib(n-1) + fib(n-2)
end
fib(4); @time fib(N)
chan1 = Channel(bufsize[1]) do ch
for i=1:100
put!(ch, N)
end
end
chan2 = Channel(bufsize[2])
function work(inchan::Channel, outchan::Channel)
for item in inchan
t1 = @elapsed(ans = fib(item))
t2 = @elapsed(put!(outchan, ans))
write(stdout, "$t1\t\t$t2\n")
end
end
workers = [Threads.@spawn work(chan1, chan2) for _=1:nthreads()]
task = @async try
foreach(wait, workers)
finally
close(chan2)
end
start_time = time()
let ans = 0.0
@time for out in chan2
ans += out
# write(stdout, "Received $out at time $(time() - start_time)\n")
end
@show ans
end
wait(task)
Output with one thread is as expected (M=10):
Julia Version 1.9.0-DEV.1060
Commit 9630911913* (2022-07-27 22:14 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 16 × AMD Ryzen 7 3700X 8-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.5 (ORCJIT, znver2)
Threads: 1 on 16 virtual cores
0.039245 seconds
0.038489566 1.091e-5
0.038821857 2.39e-6
0.039129028 3.56e-6
0.038957067 1.97e-6
0.039022858 3.2e-6
0.038853247 2.17e-6
0.038885257 2.03e-6
0.038841087 1.9e-6
0.038924737 2.05e-6
0.039081728 1.326e-5
0.462716 seconds (156.59 k allocations: 10.619 MiB, 0.95% gc time, 15.86% compilation time)
ans = 9.227465e7
Output with 16 threads (16 logical cores) contains times like this:
0.08043542 0.03993961
0.044416951 4.87e-6
0.0560119 0.012806472
0.058470526 6.999e-5
0.058279525 5.7721e-5
0.054369256 0.00416349
0.060140631 0.008733791
0.058281356 0.00025411
0.053919504 0.004409561
0.058286455 0.000943573
0.058295365 0.000935493
0.058397286 0.000869162
0.058391216 0.002046295
0.058476206 0.001996315
0.059021337 0.012297011
0.059112117 0.012210901
0.053715444 3.141e-6
0.055667919 0.010728936
0.054322325 2.53e-6
0.058348385 0.00806197
As you can see, the put!
call (IIUC, timed between put!
being run,
and the task being scheduled again) sometimes takes 40ms!
To make it maybe clearer, here is 64 threads. Obviously you expect some contention, but 0.84 seconds is too much! You can also see that evaluating the fibonacci numbers themselves takes ten times as long.
0.323439597 7.83e-6
0.211954179 0.434703116
0.21205967 0.432967271
0.338488766 6.18e-6
0.361778323 6.39e-6
0.178212246 6.34e-6
0.192800211 5.17e-6
0.158102705 5.9e-6
0.211804779 0.555197386
0.223284377 0.542639945
0.223705859 0.573294652
0.222776026 0.543168616
0.210183145 0.587802608
0.218625276 0.63237654
0.331875599 6.8e-6
0.227680118 0.740324659
0.115289758 6.38e-6
0.229289233 0.782709554
0.231637728 0.801299991
0.218063285 0.807791687
0.23240926 0.817447871
0.216793761 0.842381844
0.15221773 0.841836692
From perf report
I can see that the missing time is being spent in the
scheduler:
Overhead Command Shared Object Symbol
63.39% julia libjulia-internal.so.1.9 [.] jl_rand_ptls
10.92% julia sys.so [.] julia_multiq_deletemin_48823
8.19% julia libjulia-internal.so.1.9 [.] _jl_mutex_wait
4.53% julia [JIT] tid 326399 [.] 0x00007fa1a4088116
1.76% julia [JIT] tid 326399 [.] 0x00007fa1a40880e0
1.58% julia [JIT] tid 326399 [.] 0x00007fa1a40880e4
1.41% julia [JIT] tid 326399 [.] 0x00007fa1a40880e2
1.24% julia [JIT] tid 326399 [.] 0x00007fa1a40880e5
1.24% julia [JIT] tid 326399 [.] 0x00007fa1a4088118
The issue goes away entirely if you set bufsize=(Inf,Inf)
and both channels
are unbounded. The issue remains if you set some small channel buffer
sizes like (2,2)
.
I built with DTRACE=1
, and checking runtime probes with bpftrace
sudo bpftrace -v -p $(pidof -s julia) ~/software/julia/contrib/bpftrace/rt_all.bt | tee fib_trace.out
gives some suspicious-looking sequences, but I am out of my depth to interpret these.
...
Task pausing: 499caef0
Task running: 48518010
Task processing libuv events: 48518010
Task processed libuv events: 48518010
Task pausing: 48518010
Task running: 499caef0
Task processing libuv events: 499caef0
Task processed libuv events: 499caef0
Task pausing: 499caef0
Task running: 492c9780
Task processing libuv events: 492c9780
Task processed libuv events: 492c9780
Task pausing: 492c9780
Task running: 492caa40
Thread waking: 4000b60 (was sleeping?: 0)
Thread waking: 14000b60 (was sleeping?: 0)
Thread waking: 18000b60 (was sleeping?: 0)
Thread waking: ec000b60 (was sleeping?: 0)
Thread waking: 1c000b60 (was sleeping?: 0)
Thread waking: f4000b60 (was sleeping?: 0)
Thread waking: c000b60 (was sleeping?: 0)
Thread waking: 34000b60 (was sleeping?: 0)
Thread waking: 8000b60 (was sleeping?: 0)
Thread waking: 2c000b60 (was sleeping?: 0)
Thread waking: 28000b60 (was sleeping?: 0)
Thread waking: fc000b60 (was sleeping?: 0)
Thread waking: f0000b60 (was sleeping?: 0)
Thread waking: 24000b60 (was sleeping?: 0)
Thread waking: 20000b60 (was sleeping?: 0)
Task pausing: 492c9780
Task running: 499caef0
Task processing libuv events: 499caef0
Thread 34000b60 inserting task to multiq: 1
Thread 2c000b60 inserting task to multiq: 1
Thread 24000b60 inserting task to multiq: 1
Thread 28000b60 inserting task to multiq: 1
Thread 1c000b60 inserting task to multiq: 1
Thread 20000b60 inserting task to multiq: 1
Thread 14000b60 inserting task to multiq: 1
Thread 18000b60 inserting task to multiq: 1
Thread c000b60 inserting task to multiq: 1
Thread 4000b60 inserting task to multiq: 1
Thread 8000b60 inserting task to multiq: 1
Thread fc000b60 inserting task to multiq: 1
Thread f4000b60 inserting task to multiq: 1
Thread ec000b60 inserting task to multiq: 1
Thread waking: ec000b60 (was sleeping?: 0)
Thread wakeup: ec000b60
Thread waking: 28000b60 (was sleeping?: 0)
Thread waking: c000b60 (was sleeping?: 0)
Thread waking: 24000b60 (was sleeping?: 0)
Thread waking: fc000b60 (was sleeping?: 0)
Thread waking: 2c000b60 (was sleeping?: 0)
Thread waking: f4000b60 (was sleeping?: 0)
Thread waking: 14000b60 (was sleeping?: 0)
Thread waking: 20000b60 (was sleeping?: 0)
Thread waking: 8000b60 (was sleeping?: 0)
Thread f0000b60 inserting task to multiq: 1
Task processed libuv events: 499caef0
Thread 2c000b60 inserting task to multiq: 1
Thread 24000b60 inserting task to multiq: 1
Thread 28000b60 inserting task to multiq: 1
Thread 20000b60 inserting task to multiq: 1
Thread 14000b60 inserting task to multiq: 1
Thread c000b60 inserting task to multiq: 1
Thread 8000b60 inserting task to multiq: 1
Thread fc000b60 inserting task to multiq: 1
Thread f4000b60 inserting task to multiq: 1
Task pausing: 499caef0
Task running: 48518010
...
I should add that this is a reduced test case from an issue I found in my code where a for loop run with @threads
would use my cpu to 100%, but trying to have a worker queue with channels results in utilization of only 30-60%. This case doesn't reproduce that: htop shows cpu pegged to 100% on all cores in the fibonacci example.
I looked into it a little bit more with bpftrace, lldb, and perf. It seems that there is a loop in partr.jl in multiq_deletemin
that assumes that the task it picks from the task minheap is unassigned, and tries to call jl_set_task_tid
on it. Sometimes that task already has a different thread's tid
in it (is this supposed to ever happen?), in which case it starts an infinite loop where jl_set_task_tid
fails, and it picks the same task from the minheap again.
I checked, and enq_work
definitely puts tasks with task->tid != -1
into Partr (put in an assert, and it crashes when you try to print something, or some task waits and is woken up on a condition variable), and I found it confusing whether it's a bug or not, because the deletemin code does look like it assumes that that never happens.
https://github.com/JuliaLang/julia/blob/de375ef5167914a5398dd16726e95dfb5e08c392/base/partr.jl#L165 https://github.com/JuliaLang/julia/blob/de375ef5167914a5398dd16726e95dfb5e08c392/base/task.jl#L763
cc: @kpamnany
I have no idea how I missed this notification, very sorry.
So, if a task wait
s on a Condition
, that involves two steps: 1) adding itself to a list of waiters in the condition and 2) yielding. If the Condition
is notify
ed in between these two steps, the task could be schedule
d, i.e. added to the multi-queue before it yields. I think there are other situations in which a task gets enqueued while it is still being run. Due to this, multiq_deletemin
checks the task's tid
and won't run the task if the tid
is not -1
.
Just ran a test and found that this happens a lot. There are buckets of CPU time being wasted here. I'm digging deeper into this.
Typically, condition variables are implemented such that the two steps I mention above (add to queue and yield) happen atomically, to prevent lost wakeups. We dodge this problem with our wakeup mechanism (which is just a call to schedule
) but the cost is the defensive check in multiq_deletemin
which, when I tested, happens very frequently and hurts the throughput of all the threads.
There's currently no good solution to this problem because Julia's scheduler runs in the context of the task being switched from; in this case, the waiting task. If there is no task to switch to, the thread polls the queues and runs the should-I-sleep logic, all in the context of the waiting task. To make the two steps atomic, we'd have to either:
- push the unlock deep into the scheduler, to where it actually switches away from the waiting task, or
- switch away from the waiting task early and run the scheduler in its own context (as in the original partr implementation).
The first option is bad because holding the condition lock for a long time can be terrible for performance, and it also opens the door to a deadlock -- there's no available task to run because it is blocked trying to notify
the condition. The second option will have to do two task switches -- to the scheduler and then away -- but at ~150 ns per task switch is still the better solution by far. Unfortunately, that will have to wait for a scheduler rewrite.
If anyone has a better solution to this problem, I'm eager to hear it.
I'll open a PR tomorrow that will (hopefully) reduce the frequency of the defensive check triggering by delaying the queue unlock somewhat. This will hopefully be a net gain -- threads will spin around the retry loop less often, but the wait
call on a condition will take longer (as will any concurrent wait
or notify
calls).
Cc: @vchuravy, @pchintalapudi, @vtjnash, @JeffBezanson
Awesome writeup. Thank you!
For 1.
, how deep into the scheduler would it be? How many layers of functions are there between yield() and actually doing the task switch?
If we had a newly introduced function like unlock_and_yield(lock)
, which duplicated much of the functionality for yield and then plumbed the lock through, to the actual task switch, how much plumbing would be needed?
2. switch away from the waiting task early and run the scheduler in its own context (as in the original partr implementation).
The second option will have to do two task switches -- to the scheduler and then away -- but at ~150 ns per task switch is still the better solution by far.
For 2.
: I'm not very familiar with partr
still. I tried reading it but i didn't find this part of the code.
I'm not sure if you've already considered something like this, but one possibility is that the second task switch doesn't necessarily have to be a normal expensive task switch: if we have a one-off kind of scheduler task where we don't need to save its context, and instead we always start its function from scratch, you can do a cheaper task switch that doesn't save any state back to that scheduler task.
You can have basically this one special-cased task that never saves its state like a normal task does, and never goes in any queues, it just sits there waiting for anyone to run it, and there's no need to set a tid
on it or anything, since any thread can run it in parallel (or you could have one per thread).
I'm not sure how much time that saves, but it could help a bit.
Unfortunately, that will have to wait for a scheduler rewrite.
Do you think this would be such an invasive change that that's needed? It couldn't be bolted in?
I instrumented Julia to measure the time lost due to these discards. For the fib code in the issue, I see a bit over 1ms lost with 8 threads. Certainly not significant. I'm testing this on a lot more code and will report what I find, but this is not seeming to be "buckets of time".
I'll try bpftrace to see if there's anything unexpected.
I tested this instrumentation internally and found that the cost of this scheduler quirk is roughly 0.2% for most of our code, with a maximum of 3% for one outlier. Thus I think this issue essentially exposes overheads in our implementation of channels and in overall scheduler efficiency. There's no real fix for this (that I can see) until we improve the scheduler and our synchronization constructs across the board.