tokio
tokio copied to clipboard
meta: Runtime metrics stabilization
Tracks the stabilization of runtime statistics.
RFC: #3845 PRs: #4043
Roadmap
- [x] Release current implementation as unstable to crates.io (#4083)
- [x] Polish docs (remove TODO)
- [x] Add more counters
- [x] Write
tokio-metricsproviding a higher level api to be consumed that is easier to understand. - [ ] Resolve open questions.
- [x] Validate design by evaluating users' experience reports.
- [x] @Matthias247: graph
Open questions
- Naming: Should the type be named
Metrics,Stats, orPerfCounters. - Should
RuntimeStats::workers()return&[WorkerStats]or an iterator. - Should there be a feature flag to enable stats explicitly?
- Should there be a runtime Builder option to enable / disable stats.
- Builder API for more complex configurations, like histograms (#5685).
inc_budget_forced_yield_countshould become a per-worker metric.- Should some current counters be lowered to internal counters?
- steal_count
- steal_operations
- overflow_count
Additional counters
- Duration between last two polls.
- There are open questions related to how this should be used.
- Worker queue depth
- It is unclear how this should be tracked.
@LucioFranco @Matthias247 any opportunities to try using the metrics?
I was on vacation and after that mostly block on other things. But I might be able to try this out this week.
I will however say upfront that I'll expect mostly to report back on the general accessor APIs and how integration into an application will look like. I think that putting poll_count/steal_count/etc on a dashboard will not be super useful for most people, because the numbers in itself have no significant meaning. They don't necessarily indicate that something is right or wrong. The not-yet-implemented timing metrics are more interesting, because they would indicate issues with code in tasks blocking too long. I will nevertheless check and see how the other metrics would look like.
We should add a counter tracking the number of "false-positive" runtime wakeups. This would be incremented when a worker wakes up without having any work to do.
The not-yet-implemented timing metrics are more interesting, because they would indicate issues with code in tasks blocking too long.
I second this. In our project built on tokio, we implemented a custom macro to track poll times. It would be very useful to have it in tokio.
It have been quite a long time, what's the status of this issue? @carllerche
It is quite useful for library authors to know the number of workers in the current Runtime.
Work has stalled a bit on this, but we would be happy to mentor people on this work.
Hello @LucioFranco , is this feature ready for publish?
I don't believe this is ready for stabilization.
What's the status on this? It looks like nothing has happened for a while here.
@LucioFranco, I would be willing to pick up where you left off here, I just need to know what needs done.
I believe we have a bunch of metrics in the runtime now and tokio-metrics has an initial version published. @carllerche would know the rest. Go ahead and pick it up, I won't have time for a bit. Happy to be tagged for reviews tho.
I aim to stabilize this feature by EOY.
As I look at the current counters, I think some of the more "scheduler implementation detail" counters should be switched to be internal counters. For example, the following counters are more useful for debugging the scheduler than debugging application performance issues:
- steal_count
- steal_operations
- overflow_count
At the risk of reopening bikeshedding, we must also decide on a name: metrics, stats, or counters.
Given that we have more than just counters, that probably rules out counters as a name.
Lets see what ChatGPT suggests :laughing: :
In the context of Tokio's scheduler or any similar system, the terms "metrics" and "statistics" can overlap significantly. However, based on the definitions provided, the examples you've listed—how many tasks were scheduled, how many times the worker went to sleep, a histogram of how long it took to poll each task—could be considered both metrics and statistics. Here's why:
As Statistics: These measures describe certain aspects of your data or system operation. They give you quantitative summaries of the system performance and behaviors, which is in line with the definition of statistics.
As Metrics: These numbers can also be viewed as metrics, especially if you're using them to track and assess the performance of your system, identify bottlenecks, or measure against a particular performance target or benchmark.
So, these terms can be interchangeable in many cases. The term you choose to use can depend on your specific context or preference, or how you intend to use the data. If you're using these numbers primarily to understand system behavior or describe what's happening, "statistics" might be the more appropriate term. If you're using them to evaluate system performance or track against goals, "metrics" might be more suitable.
In general, in software systems and libraries like Tokio, we often talk about "performance metrics" or just "metrics" as it implies ongoing tracking and often is used for making decisions about system improvements or changes.
Artificial Indecisiveness
Hi! Sorry, is it too late to ask to turn all gauges into counter pairs?
Metrics like active_tasks_count or injection_queue_depth are fast moving gauges and even taking a snapshot every few seconds doesn't say much about what's going inside Tokio. It would be better to use two counters: one for additions, one for removals, and during snapshotting one can calculate the rate of how much went into a queue or many tasks got spawned during snapshotting interval and the current queue length and the number of active tasks is the delta between the counters, if needed. So it's much more usable for monitoring.