glommio
glommio copied to clipboard
protect the executor against reentrant tasks
What does this PR do?
Reentrant tasks are problematic in Glommio today. Consider the following code snippet:
glommio::spawn_local_into(async move {
loop {
glommio::executor().yield_now().await;
}
}, some_tq).unwrap();
// do other useful stuff
The problem with this pattern is that the Glommio executor runs task
queues in order of virtual runtime. Because this particular task yields
immediately, the vruntime
of its queue will increase very slowly.
The result is that the above task will run repeatedly until it has spun
enough that the parent queue vruntime
is no longer the smallest of all
the queues. Before that, other queues will not run, even after
preemption.
The solution to this problem is to detect "reentrant" tasks (i.e., self-scheduled tasks that were not preempted) and segregate them. The executor still visits all task queues each time it ticks but once a task is marked as reentrant, the executor will ignore it until the next tick. We perform the same trick on task queues too. Once a queue contains only reentrant tasks, it is no longer polled until the next time the executor ticks.
After this change, the above task runs exactly once per executor tick, freeing CPU time for more demanding tasks.
It is very easy to shoot yourself in the foot with this issue because yielding triggers it. i.e. being cooperative and yielding to the runtime worsen performance.
This PR significantly improve performance for code that uses the problematic pattern and doesn't change performance otherwise. Here is the full benchmark suite:
Before
cost to spawn an executor : 2.170627ms
cost to spawn and join an executor : 2.294355ms
cost to run task no task queue: 51ns
cost to run task in task queue: 284ns
Running unittests (target/release/deps/local_channel-f7609026e7ceb9eb)
cost of sending contended local channel: 147ns
cost of sending uncontended local channel: 20ns
Running unittests (target/release/deps/noise-5281c7c86e02a7a4)
time to complete for 100 tasks: 35.844748ms (358.447µs per task)
time to complete for 1000 tasks: 351.352711ms (351.352µs per task)
time to complete for 10000 tasks: 3.474532626s (347.453µs per task)
time to complete for 100 tasks: 999.914299372ms (453.529µs per task)
time to complete for 1000 tasks: TOO LONG
time to complete for 10000 tasks: TOO LONG
Running unittests (target/release/deps/preempt-fef87726ebbdeb0e)
cost of checking for need_preempt: 2ns
Running unittests (target/release/deps/semaphore-1865fe3e12c61604)
cost of acquiring uncontended semaphore: 15ns
cost of acquiring contended semaphore: 180ns
Running unittests (target/release/deps/sharding-29c635067c84641c)
elapsed: 13.454542939s, average cost: 33ns
Running unittests (target/release/deps/shared_channel-b612077ba23941cb)
cost of sending rust standard channel 133ns, capacity: 1024
cost of receiving rust standard channel: 133ns, capacity 1024
cost of receiving rust standard channel: 2.358µs, capacity 1
cost of sending rust standard channel 2.358µs, capacity: 1
cost of sending shared channel 94ns, capacity: 1024
cost of receiving shared channel: 94ns, capacity 1024
cost of sending shared channel 7.948µs, capacity: 1
cost of receiving shared channel: 7.948µs, capacity 1
cost of sending (tokio) shared channel 78ns, capacity: 1024
cost of receiving (tokio) shared channel: 78ns, capacity 1024
cost of sending (tokio) shared channel 8.183µs, capacity: 1
cost of receiving (tokio) shared channel: 8.183µs, capacity 1
cost of mesh message shared channel: 26.856µs, peers 8, capacity 1024
cost of mesh message shared channel: 27.024µs, peers 8, capacity 1
Running unittests (target/release/deps/spsc_queue-20817fea22888530)
cost of receiving 6ns, capacity 1024
cost of sending 6ns, capacity 1024
Running unittests (target/release/deps/tcp-8eb2a65cfa25987f)
cost to establish a connection (serial connect) 48.444µs
cost to establish a connection (parallel connect) 20.1µs
cost of a byte-sized round trip 13.758µs
Bandwidth of sending through a single socket: 57.68 Gbps/s
Bandwidth of sending through 100 sockets: 65.21 Gbps/s
Running unittests (target/release/deps/tokio_tcp-71e63d7a5bea27d5)
cost to establish a connection 48.137µs
cost of a byte-sized round trip 28.413µs
Bandwidth of sending through a single socket: 24.37 Gbps/s
Bandwidth of sending through 100 sockets: 105.59 Gbps/s
Running unittests (target/release/deps/udp-57ccc15271b17a49)
cost of a byte-sized round trip over connected socket 12.557µs
cost of a byte-sized round trip over sendto/recvfrom 12.916µs
After:
cost to spawn an executor : 2.171802ms
cost to spawn and join an executor : 2.439593ms
cost to run task no task queue: 56ns
cost to run task in task queue: 329ns
Running unittests (target/release/deps/local_channel-f7609026e7ceb9eb)
cost of sending contended local channel: 150ns
cost of sending uncontended local channel: 20ns
Running unittests (target/release/deps/noise-5281c7c86e02a7a4)
time to complete for 100 tasks: 42.720486ms (427.204µs per task)
time to complete for 1000 tasks: 534.915492ms (534.915µs per task)
time to complete for 10000 tasks: 4.095607766s (409.56µs per task)
time to complete for 100 tasks: 45.352926ms (453.529µs per task)
time to complete for 1000 tasks: 711.594108ms (711.594µs per task)
time to complete for 10000 tasks: 4.439375159s (443.937µs per task)
Running unittests (target/release/deps/preempt-fef87726ebbdeb0e)
cost of checking for need_preempt: 2ns
Running unittests (target/release/deps/semaphore-1865fe3e12c61604)
cost of acquiring uncontended semaphore: 15ns
cost of acquiring contended semaphore: 205ns
Running unittests (target/release/deps/sharding-29c635067c84641c)
elapsed: 13.207823689s, average cost: 32ns
Running unittests (target/release/deps/shared_channel-b612077ba23941cb)
cost of sending rust standard channel 133ns, capacity: 1024
cost of receiving rust standard channel: 133ns, capacity 1024
cost of receiving rust standard channel: 2.378µs, capacity 1
cost of sending rust standard channel 2.378µs, capacity: 1
cost of sending shared channel 31ns, capacity: 1024
cost of receiving shared channel: 31ns, capacity 1024
cost of sending shared channel 7.796µs, capacity: 1
cost of receiving shared channel: 7.796µs, capacity 1
cost of sending (tokio) shared channel 98ns, capacity: 1024
cost of receiving (tokio) shared channel: 98ns, capacity 1024
cost of sending (tokio) shared channel 7.996µs, capacity: 1
cost of receiving (tokio) shared channel: 7.996µs, capacity 1
cost of mesh message shared channel: 23.936µs, peers 8, capacity 1024
cost of mesh message shared channel: 26.592µs, peers 8, capacity 1
Running unittests (target/release/deps/spsc_queue-20817fea22888530)
cost of receiving 7ns, capacity 1024
cost of sending 7ns, capacity 1024
Running unittests (target/release/deps/tcp-8eb2a65cfa25987f)
cost to establish a connection (serial connect) 42.353µs
cost to establish a connection (parallel connect) 19.822µs
cost of a byte-sized round trip 14.084µs
Bandwidth of sending through a single socket: 59.67 Gbps/s
Bandwidth of sending through 100 sockets: 64.45 Gbps/s
Running unittests (target/release/deps/tokio_tcp-71e63d7a5bea27d5)
cost to establish a connection 47.065µs
cost of a byte-sized round trip 30.641µs
Bandwidth of sending through a single socket: 26.24 Gbps/s
Bandwidth of sending through 100 sockets: 103.69 Gbps/s
Running unittests (target/release/deps/udp-57ccc15271b17a49)
cost of a byte-sized round trip over connected socket 13.136µs
cost of a byte-sized round trip over sendto/recvfrom 13.605µs
I must be missing something. In terms of results themselves, while you claim "A gets better while nothing gets worse", a lot of benchmarks are actually worse on "After". The cost of a contended semaphore is higher, the cost of running a task in a task queue is higher, and even the ones that do complete on the noise benchmark are sometimes higher.
I don't fully understand how you define a "reentrant task", nor why this concept is needed or related to the problem you describe. There are places where we manually adjust the vruntime of a queue. For example, when the task queue has been idle. Why can't you just adjust the vruntime on explicit yields ?
Also, it may be worth checking your application. While we can obviously improve things, maybe the solution involves stopping using them. yield_if_needed()
is a much better pattern. I understand avoiding using it because there are known issues with how fast Rust's TLS is, but if that creates another problem of a different kind then all of a sudden TLS access doesn't sound that expensive in comparison.
I would have never thought about using a pattern such as this.. I'm assuming there's some simplification going on and that's why it isn't striking me. The idea of just yielding in a loop doesn't seem to make a lot of sense as written but I'm assuming there's SOME real work that will happen somewhere.. Can you describe more why you'd use this pattern and for what reason?