Misleading recording rule for cluster_namespace_deployment:container_cpu_usage_seconds_total:sum_rate

Open marybelvargas opened this issue 1 year ago • 0 comments

Describe the bug

The original rule does not filter by container or image so for people who didn’t drop the “total” in the scrape initially the result would be double of the actual usage.

If we look at the memory one it does have the filter {image!=""} which is more standard. So the calculation is more accurate.

As both cpu and memory here will be coming from the same job (cadvisor), the recording rule should be consistent. Either have the filter in place for both calculation, or let people know that they should deal with this at scrape time.

Current definition is:

sum by (cluster, namespace, deployment) (
  label_replace(
    label_replace(
      sum by (cluster, namespace, pod)(rate(container_cpu_usage_seconds_total[1m])),
      "deployment", "$1", "pod", "(.*)-(?:([0-9]+)|([a-z0-9]+)-([a-z0-9]+))"
    ),
    # The question mark in "(.*?)" is used to make it non-greedy, otherwise it
    # always matches everything and the (optional) zone is not removed.
    "deployment", "$1", "deployment", "(.*?)(?:-zone-[a-z])?"
  )
)

Aug 14 '24 18:08 marybelvargas