semantic-conventions icon indicating copy to clipboard operation
semantic-conventions copied to clipboard

Introduce an optional normalized total CPU utilization metric

Open rogercoll opened this issue 10 months ago • 11 comments

Propose new conventions

Knowing what "cpu utilization" means isn’t trivial. There can be different interpretations of it and some users would want control over whether iowait or steal states count towards the usage or not. But there should be a simple, opinionated, and standardized metric that doesn’t need custom processors, or advanced knowledge of PromQL and cpu state philosophy to be useful for most use cases.

As part of the standardization, we need to align which CPU states count towards "cpu utilization". A lot of examples recommend something like 1-idle. While there are other https://github.com/prometheus/node_exporter/pull/2194 that recommend 1-(idle|iowait|steal). The system.cpu.total.norm.pct metric of the Elastic System integration is defined as 1-(idle|iowait).

While it is possible to create a custom processor that calculates another derived metric, or using a complex PromQL query, standardizing the semantics for an overall cpu utilization would be a big win in terms of usability for such a core metric. We don’t want to open the floodgates on adding derived/convenience metrics for every possible use case but CPU utilization is such a fundamental metric that almost everyone needs, where the semantics aren’t obvious, and where complex queries are required at the moment.

Also, replicating these complex PromQL queries isn’t feasible for all backends. So from a perspective of being more vendor-agnostic, adding a standardized metric that is easily aggregatable makes sense, too.

Similar to system.cpu.utilization, this should be opt-in. In contrast to system.cpu.utilization, it should not have attributes for cpu.mode or system.cpu.logical_number.

To introduce an optional normalized total CPU utilization metric, we propose the following solutions:

Proposed solution 1 (new cpu namespace)

After discussing this topic with the System's working group SIG (1/30/2025), it seems that current system.cpu.* metrics do not reflect the CPU time/usage over the whole system but relative to a logical CPU. Each reported metric is specific to a CPU (system.cpu.logical_number), thus we would propose moving them to a new CPU namespace that will define metrics relative to a unique logical CPU. Also, they are inconsisten with container's CPU metrics (time/usage), as the last are reported globally, not per logical CPU.

After decoupling CPU metrics from the system namespace, add a new opinionated metric that reflects the overall CPU usage over a given system. We propose using the actual system.cpu.utilization and would be computed using CPU metrics (1 - (idle + iowait + steal)) but aggregated over all logical CPUs available in the system. This metric will be a simple, opinionated calculation that provides an aggregated view of CPU utilization of the whole system without requiring advanced PromQL queries or custom processors.

Proposed solution 2 (rename)

The idea would be to reposition the current system.cpu.utilization metric to system.cpu.usage:

  • The current system.cpu.utilization metric will be renamed to system.cpu.usage. Its actual definition is unclear, and it does not represent a fractional usage over a limit, instead, it is averaged over the sampled window: https://github.com/open-telemetry/semantic-conventions/issues/647#issuecomment-1898949106
  • In fact, the current metric’s description implies that each CPU usage value is system-wide because it is divided by the number of logical CPUs. But the actual implementation in the opentelemetry-collector does not make this division, but just include the associated cpu logical number as an attribute. This is inconsistent with other metrics like “process.cpu.utilization” which indeed divides over the amount of logical CPUs https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/c4bee406f5db233a2bc7f0c69185764fea9fb033/receiver/hostmetricsreceiver/internal/scraper/processscraper/ucal/cpu_utilization_calculator.go#L60
  • This metric will continue to provide per-mode CPU utilization values. It will retain its existing attributes (cpu.mode and system.cpu.logical_number). Already proposed in https://github.com/open-telemetry/semantic-conventions/issues/1130#issuecomment-2257632204

Same as proposal 1, add a new system.cpu.utilization metric to reflect the overall system CPU usage.

Other

Not ideally as it looks like current system metrics are inconsistent with their definition, but we could clarify them and introduce a new metric that does not require any renaming (e.g system.cpu.total.utilization?)

rogercoll avatar Feb 05 '25 09:02 rogercoll