tokio icon indicating copy to clipboard operation
tokio copied to clipboard

RFC: Sharding the timer wheel

Open ADD-SP opened this issue 11 months ago • 1 comments

Fundamental difficulty

Sharing the deadline across all worker thread.

  • The timer wheel is for calculating the deadline efficiently.
  • The global wheel is for single source of truth, otherwise we have to sync the wheel state across all worker threads.

This RFC doesn't solve it, it just like making the best of a hard job.

The #7343 is an attempt to avoid solving this fundamental difficulty because it doesn't have it, but it's still not perfect.

Summary

  • This proposal involves splitting the global timer wheel into several local wheels for each worker thread.
  • This proposal does not break any existing public stable APIs.

Motivation

Please check out #6504 for initial context.

How does timer wheel work?

Image

A timer wheel is consist by N level of ring arrays.

Each slot in the lowest array represents one millisecond. As each millisecond passes, the indexes are turned and the timers stored in the relevant slot are fired.

After each full revolution, the higher ring array is turned and the pointed timers are fired.

Current implementation

Image

There is only one global timer wheel for each runtime. It is protected by a Mutex, so acquiring the Mutex is required for creating, firing or cancelling timers.

Problems

As mentioned in #6504, creating a timer for the socket and cancelling it once the socket is readable/writable is the most common way to enforce timeouts for socket operations.

Apparently, work threads frequently race the Mutex, even when only processing a socket they own.

Guide-level explanation

Image

We will build a local timer wheel for each worker thread, this wheel still requires Mutex to allow timer-stealing.

  • Local wheel can efficiently create, fire and cancel timers without contentions (except timer stealing).
  • Local wheel exposes an AtomicU64 that records the elapsed milliseconds.
  • On each call to park(), this worker thread will select a random worker and check how far behind the remote wheel it is. It will steal expired timers if it is too far behind.

Details

Feature flag

All codes will be gated by the tokio_unstable_lockless_timer until the implementation is stable.

Mutex<Wheel> but expose the wheel.elapsed using AtomicU64.

The exposed wheel.elapsed allows worker thread to inspect remote thread without acquiring the Mutex.

  • It is rare to steal timers from remote.
  • Registering/canceling timers just locks the wheel for only short periods, which is much cheaper than OS-level context switching.

Worker unparking

A runtime-level AtomicU64 will be updated on each registering of timer, which allows parker to get the park timeout efficiently.

Multi-core scalability

This solution doesn't has a well scalability in multi-core machine, but still better than global Mutex.

Drawbacks

  • It is hard to get a reasonable threshold about how far after the remote wheel is behind to start stealing the timer, small value increases contentions, large value might delay the timers too much.
  • Overall is a huge change, and obviously risky.

Alternatives

Mutex<Wheel> for each worker thread

This is what the https://github.com/tokio-rs/tokio/commit/1914e1e4b9bfe6ea2d61970ec3fcf2b5d7bb0210 did, and has been reverted by https://github.com/tokio-rs/tokio/commit/1ae9434e8e4a419ce25644e6c8d2b2e2e8c34750 due to increased overhead by locking all Mutex<Wheel> on the hot path while maintaining the Core.

Pure local Wheel without timer-stealing

This reduces the risk and coding works well, but if the worker thread is blocked, the local timers might lag too much. This would definitely make the timer system worse than the current implementation.

Prior art

Covered by Alternatives section.

Unresolved questions

  • What is the reasonable threshold to start stealing timers? How many timers for each stealing?
  • Does the model work if the future is stolen to a remote thread but the associated timer is on the local wheel?

ADD-SP avatar May 17 '25 11:05 ADD-SP

History:

  • Replace the spin lock by Mutex because Mutex will spin for a while before sleeping.
  • Replaced the lock-less wheel with spin locked wheel due to the complexity of lock-less doubly linked list.

ADD-SP avatar May 18 '25 10:05 ADD-SP

Superseded by #7384

ADD-SP avatar Sep 16 '25 14:09 ADD-SP