semantic-conventions icon indicating copy to clipboard operation
semantic-conventions copied to clipboard

Guidelines on opt-in/recommended status of the `*.usage/*.limit/*.utilization/*.time` metrics

Open dmitryax opened this issue 7 months ago • 3 comments

What's missing?

The Instrument Naming section defines the *.usage, *.limit, *.utilization, and *.time metrics, but it does not specify their requirement levels (Recommended vs. Optional). Because these metrics convey overlapping information in different forms, implementations may become inconsistent without explicit guidance.

Describe the solution you'd like

The OTel Collector has a similar concept of default and optional metrics. There we follow the rule that basic metrics are enabled by default, while derived metrics in the set are optional. For example, for CPU, *.cpu.time is the basic metric (enabled by default), whereas *.cpu.utilization and *.cpu.usage are optional. For memory, *.memory.usage is the default metric, and *.memory.utilization is optional.

We could designate the *.utilization metrics as Recommended because they are more UI/UX-friendly, and then adjust the Collector defaults accordingly. However, it's harder to apply spatial aggregation to the utilization metrics, whereas base metrics such as *.cpu.time or *.usage can be simply summed.

We should probably introduce generic guidelines stating that only one metric in a group that conveys the same information can be Recommended. And, for this set of metrics, we can specify which are Recommended and which are Opt-In.

dmitryax avatar Apr 24 '25 05:04 dmitryax

@open-telemetry/semconv-system-approvers should that be covered before any metrics moving to GA?

ChrsMark avatar Apr 30 '25 13:04 ChrsMark

We discussed this issue at the System Semantic Conventions meeting May 8, 2025. We discussed the value of usage-style vs utilization-style being the Recommended option. As a matter of providing data, I checked in on GCP's recommended Alert templates, as a wide swath of GCP users end up using these to (presumably) good effect.

GCE VMs

In the Alert Templates for GCE VMs, CPU and Disk Alerts are based on utilization-style metrics.

Image

Image

GKE Auto-Pilot Cluster

These Alert Templates recommend utilization-style based on the CPU/Memory Limits for the cluster.

Image

Image


I don't think these datapoints should directly sway us one way or the other; these are recommended Alert Templates that show up directly in the UI for anyone who clicks on the Observability tab of their resources, so they are by necessity covering lowest-common-denominator use cases. I figured I'd provide them as datapoints in case it helps the discussion along.

braydonk avatar May 09 '25 13:05 braydonk

+1 for CPU utilization (similar to the one proposed in #2088 that only aggregates across "active" cpu modes) being the most common CPU metric to alert on

trask avatar May 09 '25 19:05 trask

The OTel Collector has a similar concept of default and optional metrics. There we follow the rule that basic metrics are enabled by default, while derived metrics in the set are optional. For example, for CPU, *.cpu.time is the basic metric (enabled by default), whereas *.cpu.utilization and *.cpu.usage are optional. For memory, *.memory.usage is the default metric, and *.memory.utilization is optional.

We could designate the *.utilization metrics as Recommended because they are more UI/UX-friendly, and then adjust the Collector defaults accordingly. However, it's harder to apply spatial aggregation to the utilization metrics, whereas base metrics such as *.cpu.time or *.usage can be simply summed.

We should probably introduce generic guidelines stating that only one metric in a group that conveys the same information can be Recommended. And, for this set of metrics, we can specify which are Recommended and which are Opt-In.

It seems that it's not trivial to standardize on *cpu.usage and *cpu.utilisation, specially across domains like system, containers, k8s etc. We've already hit this sort of long running conversations at https://github.com/open-telemetry/semantic-conventions/issues/1873.

Utilization for k8s can be really ambiguous when there are concepts like container/pod limits, node's capacity, node's allocatable cpu etc.

cpu.usage in k8s is derived directly from the Kubelet's stats API and is calculated in an opinionated way measured in CPU core-nanoseconds per second where the sample window is fixed in 10 seconds: https://github.com/kubernetes/kubernetes/blob/ebd25a55e32955b03f29f16d2128c05d9d625745/pkg/kubelet/stats/cri_stats_provider.go#L123 (@dashpole check me on this)

Docker also provides a different approach for calculating the CPU utilization as desribed in the API's docs and in the cli implementation.

In this, since there is not straight forward way to standardize *cpu.usage and *cpu.utilization, I lean towards having only the unambiguous *cpu.time one as recommended and all the rest as opt-in.

ChrsMark avatar Sep 18 '25 13:09 ChrsMark

cpu.usage in k8s is derived directly from the Kubelet's stats API and is calculated in an opinionated way measured in CPU core-nanoseconds per second where the sample window is fixed in 10 seconds: https://github.com/kubernetes/kubernetes/blob/ebd25a55e32955b03f29f16d2128c05d9d625745/pkg/kubelet/stats/cri_stats_provider.go#L123 (@dashpole check me on this)

We (the K8s Node SIG) generally considered the addition of the windowed cpu usage rate metric to be a mistake. It was added for convenience and ended up causing a lot of user confusion. We recommend using the cumulative cpu usage metric instead and try to steep people away from the windowed version as much as possible.

Edit: Your statement is correct.

dashpole avatar Sep 18 '25 13:09 dashpole