seastar Clarify reactor utilization metrics

The key CPU use metric for seastar is "reactor utilization". This is the time spent on CPU [1] except that doing busy polling (i.e., when there is no additional work to do but we continue polling for IO because "why not" and "that's how we set ourselves up to get disk completions").

However, we simply call this "CPU utilization" or "CPU busy time" in the metrics, which is confusing, since it is very different from actual OS-reported CPU utilization because it excludes polling time. This is a frequent source of confusion when people expect this to act like CPU utilization or use this interchangeably with CPU utilization reported by the OS (e.g., via node_exporter).

This change clarifies the description, calling it "Reactor non-polling CPU utilization". Furthermore, it also clarifies the specific definition of the utilization (gauge) variation of the metric: it is a five second rolling average.

Jan 28 '24 16:01 travisdowns

[1] Even calling it CPU utilization isn't strictly correct as it's a wall-time measurement, not a CPU time measurement (e.g., via rusage), so steal time will inflate reactor utilization, unlike how it works for competing processes and CPU utilization.

Jan 28 '24 16:01 travisdowns

I suppose also "non-polling" is a bit imprecise, it should really be "non-busy-polling" as we do include in reactor utilization the polls we do when not idle, i.e., while there is still more work to do after.

Consider the beginning of a discussion.

Jan 28 '24 16:01 travisdowns

A suggestion -- the proposed clarification is out-out-ish, it says that "this is CPU usage excluding what reactor does", but it could as well be opt-in-ish by saying smth like "this is CPU usage by tasks and timers"

Feb 02 '24 13:02 xemul

A suggestion -- the proposed clarification is out-out-ish, it says that "this is CPU usage excluding what reactor does", but it could as well be opt-in-ish by saying smth like "this is CPU usage by tasks and timers"

In general I agree with that approach to phrasing things but the difficulty is that it really is defined as runtime - polling time, if I use the other approach I need to list all the things that the reactor could do that are useful? Tasks, timers, flushing sockets, reading from sockets, some types of polling for IO (i.e., when results are coming back).

Anyway, here's an attempt:

Total reactor CPU time in active processing. Active processing includes all work, such as running tasks, executing timers, submitting and reaping IO, but excludes idle polling and sleep.

WDYT?

Feb 02 '24 13:02 travisdowns

Yes, you're right, opt-in it looks terrible. How about this then:

cpy_busy_time: Total cpu busy time in milliseconds excluding idle-polling utilization: Average CPU busy time for the last 5-seconds

?

Feb 02 '24 14:02 xemul

@xemul :

How about this then:

cpy_busy_time: Total cpu busy time in milliseconds excluding idle-polling utilization: Average CPU busy time for the last 5-seconds

I really wanted to have "reactor" in there as it was one of the clarifications I was trying to make: that this is time measured in the reactor, not just in seastar genreally: e.g., it does not include time that the syscall thread or any alien threads are running. I think the description needs to standalone since the metric name/group is usually not visible, so it's not obvious we are talking about reactor from the description.

So WDYT about:

cpy_busy_time: Total reactor CPU busy time in milliseconds excluding time spent idle-polling. utilization: 5-second average of reactor CPU busy time, excluding time spent idle-polling, in percent.

Feb 02 '24 16:02 travisdowns

So WDYT about:

cpy_busy_time: Total reactor CPU busy time in milliseconds excluding time spent idle-polling.
utilization: 5-second average of reactor CPU busy time, excluding time spent idle-polling, in percent.

:+1: let's go this way

Feb 05 '24 13:02 xemul

seastar seastar copied to clipboard

Clarify reactor utilization metrics

seastar
seastar copied to clipboard