kube-prometheus icon indicating copy to clipboard operation
kube-prometheus copied to clipboard

Prometheus recording rules not working with containerd runtime

Open ArchiFleKs opened this issue 3 years ago • 4 comments

What happened?

I deployed prometheus-operator as usual on a new Kubernetes cluster. I'm used to using it on EKS. The only different with this new cluster is that containerd is used as a runtime. The following recording rule fails

         sum by (cluster, namespace, pod, container) (
            irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m])
          ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (
            1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
          )
        record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate

This one in particular container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""} return metrics when using Docker but not when using containerd. This `container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor"} seems to work on containerd.

I have tried both query on both cluster

Did you expect to see some different?

I expected to see CPU usage in Grafana

How to reproduce it (as minimally and precisely as possible):

Deploy kube-prometheus-stack on EKS with containerd

Environment

  • Prometheus Operator version: 0.50

  • Kubernetes version information: v1.21

  • Kubernetes cluster kind: EKS with EKS ami using containerd as container runtime

https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/kubernetes-prometheusRule.yaml#L1161

Anything else we need to know?: I'm not 100% sure that containerd is the issue but based on the recording rules query:

  • when using Docker: there is image label
  • when using Containerd: there is no image label so the query returns nothing

ArchiFleKs avatar Sep 21 '21 16:09 ArchiFleKs

What is the containerd version used in EKS? kubectl describe node ... should have this information under System Info -> Container Runtime Version section.

I cannot replicate this issue with containerd 1.4.8.

paulfantom avatar Oct 18 '21 14:10 paulfantom

I'm facing the same issue, but I could notice if I evaluate the query behind the recording rule, I'm able to see the results. But I'm using AKS instead of EKS.

nicolastakashi avatar Jan 18 '22 21:01 nicolastakashi

I've found an issue from my side, the labels that Prometheus is using to select the PrometheusRule file were wrong.

nicolastakashi avatar Jan 19 '22 20:01 nicolastakashi

@ArchiFleKs did you manage to overcome this other than removing the image label? also, I didn't notice any changes for the metric values returned with or without it comparing a cluster that's using dockerd. on-prem.

rotemsh15 avatar Dec 06 '22 09:12 rotemsh15