kube-prometheus
kube-prometheus copied to clipboard
Prometheus recording rules not working with containerd runtime
What happened?
I deployed prometheus-operator as usual on a new Kubernetes cluster. I'm used to using it on EKS. The only different with this new cluster is that containerd is used as a runtime. The following recording rule fails
sum by (cluster, namespace, pod, container) (
irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m])
) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (
1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
)
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
This one in particular container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""} return metrics when using Docker but not when using containerd. This `container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor"} seems to work on containerd.
I have tried both query on both cluster
Did you expect to see some different?
I expected to see CPU usage in Grafana
How to reproduce it (as minimally and precisely as possible):
Deploy kube-prometheus-stack
on EKS with containerd
Environment
-
Prometheus Operator version: 0.50
-
Kubernetes version information: v1.21
-
Kubernetes cluster kind: EKS with EKS ami using containerd as container runtime
https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/kubernetes-prometheusRule.yaml#L1161
Anything else we need to know?: I'm not 100% sure that containerd
is the issue but based on the recording rules query:
- when using Docker: there is
image
label - when using Containerd: there is no image label so the query returns nothing
What is the containerd version used in EKS? kubectl describe node ...
should have this information under System Info -> Container Runtime Version
section.
I cannot replicate this issue with containerd 1.4.8.
I'm facing the same issue, but I could notice if I evaluate the query behind the recording rule, I'm able to see the results. But I'm using AKS instead of EKS.
I've found an issue from my side, the labels that Prometheus is using to select the PrometheusRule file were wrong.
@ArchiFleKs did you manage to overcome this other than removing the image label? also, I didn't notice any changes for the metric values returned with or without it comparing a cluster that's using dockerd. on-prem.