tokio meta: Runtime metrics stabilization

Tracks the stabilization of runtime statistics.

RFC: #3845 PRs: #4043

Roadmap

[x] Release current implementation as unstable to crates.io (#4083)
[x] Polish docs (remove TODO)
[x] Add more counters
[x] Write tokio-metrics providing a higher level api to be consumed that is easier to understand.
[ ] Resolve open questions.
[x] Validate design by evaluating users' experience reports.
- [x] @Matthias247: graph

Open questions

Naming: Should the type be named Metrics, Stats, or PerfCounters.
Should RuntimeStats::workers() return &[WorkerStats] or an iterator.
Should there be a feature flag to enable stats explicitly?
- Should there be a runtime Builder option to enable / disable stats.
Builder API for more complex configurations, like histograms (#5685).
inc_budget_forced_yield_count should become a per-worker metric.
Should some current counters be lowered to internal counters?
- steal_count
- steal_operations
- overflow_count

Additional counters

Duration between last two polls.
- There are open questions related to how this should be used.
Worker queue depth
- It is unclear how this should be tracked.

Aug 26 '21 16:08 carllerche

@LucioFranco @Matthias247 any opportunities to try using the metrics?

Sep 21 '21 18:09 carllerche

I was on vacation and after that mostly block on other things. But I might be able to try this out this week.

I will however say upfront that I'll expect mostly to report back on the general accessor APIs and how integration into an application will look like. I think that putting poll_count/steal_count/etc on a dashboard will not be super useful for most people, because the numbers in itself have no significant meaning. They don't necessarily indicate that something is right or wrong. The not-yet-implemented timing metrics are more interesting, because they would indicate issues with code in tasks blocking too long. I will nevertheless check and see how the other metrics would look like.

Sep 28 '21 01:09 Matthias247

We should add a counter tracking the number of "false-positive" runtime wakeups. This would be incremented when a worker wakes up without having any work to do.

Dec 23 '21 22:12 carllerche

The not-yet-implemented timing metrics are more interesting, because they would indicate issues with code in tasks blocking too long.

I second this. In our project built on tokio, we implemented a custom macro to track poll times. It would be very useful to have it in tokio.

Dec 31 '21 10:12 e-ivkov

It have been quite a long time, what's the status of this issue? @carllerche

It is quite useful for library authors to know the number of workers in the current Runtime.

May 05 '22 06:05 zonyitoo

Work has stalled a bit on this, but we would be happy to mentor people on this work.

May 05 '22 15:05 LucioFranco

Hello @LucioFranco , is this feature ready for publish?

Jun 29 '22 07:06 zonyitoo

I don't believe this is ready for stabilization.

Jun 29 '22 16:06 LucioFranco

What's the status on this? It looks like nothing has happened for a while here.

@LucioFranco, I would be willing to pick up where you left off here, I just need to know what needs done.

Aug 29 '22 17:08 Noah-Kennedy

I believe we have a bunch of metrics in the runtime now and tokio-metrics has an initial version published. @carllerche would know the rest. Go ahead and pick it up, I won't have time for a bit. Happy to be tagged for reviews tho.

Aug 29 '22 17:08 LucioFranco

I aim to stabilize this feature by EOY.

As I look at the current counters, I think some of the more "scheduler implementation detail" counters should be switched to be internal counters. For example, the following counters are more useful for debugging the scheduler than debugging application performance issues:

steal_count
steal_operations
overflow_count

May 24 '23 18:05 carllerche

At the risk of reopening bikeshedding, we must also decide on a name: metrics, stats, or counters.

Given that we have more than just counters, that probably rules out counters as a name.

Lets see what ChatGPT suggests :laughing: :

In the context of Tokio's scheduler or any similar system, the terms "metrics" and "statistics" can overlap significantly. However, based on the definitions provided, the examples you've listed—how many tasks were scheduled, how many times the worker went to sleep, a histogram of how long it took to poll each task—could be considered both metrics and statistics. Here's why:

As Statistics: These measures describe certain aspects of your data or system operation. They give you quantitative summaries of the system performance and behaviors, which is in line with the definition of statistics.

As Metrics: These numbers can also be viewed as metrics, especially if you're using them to track and assess the performance of your system, identify bottlenecks, or measure against a particular performance target or benchmark.

So, these terms can be interchangeable in many cases. The term you choose to use can depend on your specific context or preference, or how you intend to use the data. If you're using these numbers primarily to understand system behavior or describe what's happening, "statistics" might be the more appropriate term. If you're using them to evaluate system performance or track against goals, "metrics" might be more suitable.

In general, in software systems and libraries like Tokio, we often talk about "performance metrics" or just "metrics" as it implies ongoing tracking and often is used for making decisions about system improvements or changes.

May 24 '23 18:05 carllerche

Artificial Indecisiveness

May 25 '23 21:05 Noah-Kennedy

Hi! Sorry, is it too late to ask to turn all gauges into counter pairs? Metrics like active_tasks_count or injection_queue_depth are fast moving gauges and even taking a snapshot every few seconds doesn't say much about what's going inside Tokio. It would be better to use two counters: one for additions, one for removals, and during snapshotting one can calculate the rate of how much went into a queue or many tasks got spawned during snapshotting interval and the current queue length and the number of active tasks is the delta between the counters, if needed. So it's much more usable for monitoring.

Jul 20 '23 09:07 cloneable

tokio tokio copied to clipboard

meta: Runtime metrics stabilization

Roadmap

Open questions

Additional counters

tokio
tokio copied to clipboard