RFC: Sharding the timer wheel
Fundamental difficulty
Sharing the deadline across all worker thread.
- The timer wheel is for calculating the deadline efficiently.
- The global wheel is for single source of truth, otherwise we have to sync the wheel state across all worker threads.
This RFC doesn't solve it, it just like making the best of a hard job.
The #7343 is an attempt to avoid solving this fundamental difficulty because it doesn't have it, but it's still not perfect.
Summary
- This proposal involves splitting the global timer wheel into several local wheels for each worker thread.
- This proposal does not break any existing public stable APIs.
Motivation
Please check out #6504 for initial context.
How does timer wheel work?
A timer wheel is consist by N level of ring arrays.
Each slot in the lowest array represents one millisecond. As each millisecond passes, the indexes are turned and the timers stored in the relevant slot are fired.
After each full revolution, the higher ring array is turned and the pointed timers are fired.
Current implementation
There is only one global timer wheel for each runtime. It is protected by a Mutex, so acquiring the Mutex is required for creating, firing or cancelling timers.
Problems
As mentioned in #6504, creating a timer for the socket and cancelling it once the socket is readable/writable is the most common way to enforce timeouts for socket operations.
Apparently, work threads frequently race the Mutex, even when only processing a socket they own.
Guide-level explanation
We will build a local timer wheel for each worker thread, this wheel still requires Mutex to allow timer-stealing.
- Local wheel can efficiently create, fire and cancel timers without contentions (except timer stealing).
- Local wheel exposes an
AtomicU64that records the elapsed milliseconds. - On each call to
park(), this worker thread will select a random worker and check how far behind the remote wheel it is. It will steal expired timers if it is too far behind.
Details
Feature flag
All codes will be gated by the tokio_unstable_lockless_timer until the implementation is stable.
Mutex<Wheel> but expose the wheel.elapsed using AtomicU64.
The exposed wheel.elapsed allows worker thread to inspect remote thread without acquiring the Mutex.
- It is rare to steal timers from remote.
- Registering/canceling timers just locks the wheel for only short periods, which is much cheaper than OS-level context switching.
Worker unparking
A runtime-level AtomicU64 will be updated on each registering of timer, which allows parker to get the park timeout efficiently.
Multi-core scalability
This solution doesn't has a well scalability in multi-core machine, but still better than global Mutex.
Drawbacks
- It is hard to get a reasonable threshold about how far after the remote wheel is behind to start stealing the timer, small value increases contentions, large value might delay the timers too much.
- Overall is a huge change, and obviously risky.
Alternatives
Mutex<Wheel> for each worker thread
This is what the https://github.com/tokio-rs/tokio/commit/1914e1e4b9bfe6ea2d61970ec3fcf2b5d7bb0210 did, and has been reverted by https://github.com/tokio-rs/tokio/commit/1ae9434e8e4a419ce25644e6c8d2b2e2e8c34750 due to increased overhead by locking all Mutex<Wheel> on the hot path while maintaining the Core.
Pure local Wheel without timer-stealing
This reduces the risk and coding works well, but if the worker thread is blocked, the local timers might lag too much. This would definitely make the timer system worse than the current implementation.
Prior art
Covered by Alternatives section.
Unresolved questions
- What is the reasonable threshold to start stealing timers? How many timers for each stealing?
- Does the model work if the future is stolen to a remote thread but the associated timer is on the local wheel?
History:
- Replace the spin lock by
MutexbecauseMutexwill spin for a while before sleeping. -
Replaced the lock-less wheel with spin locked wheel due to the complexity of lock-less doubly linked list.
Superseded by #7384