kube-state-metrics
kube-state-metrics copied to clipboard
Add an option to allow adding Kubernetes Labels as Prometheus Metric Labels per metric-series
I created this Issue for discussing and tracking PR #1689 Below is a transcript of the PR 👇🏻
What this PR does / why we need it:
This PR Introduce a new flag --per-metric-labels-allowlist that its syntax works similar to --metric-labels-allowlist but instead of being a filter to add labels to kube_X_labels, it is a filter for K8S' labels that will be added to each metric time series of a resource.
Motivation
Motivation for this change can be better described from a Platform Team POV, who is responsible for the observability stack and provides it to multiple teams and/or multiple tenants.
The goal is to make it easier to define queries, alerts, and rules without the need for complex joins. Making the barrier for smaller less experienced teams smaller. As well as alleviate pressure from Prometheus Server that's constantly doing joins for each alert rule.
Use Case 1
- A Development Team wants to create alerts for the multiple components of their applications.
- Different Components have different alerts, severities, thresholds. (example, web pods, background consumers, different kinds of jobs) and as components live in the same namespace; filtering with a namespace is not feasible.
- You'll now have to use joins with
kube_X_labelsto filter for the specific resources. Complex queries become more complex, especially ones that had joins already.
Use Case 2
- Platform Team defining general, default rules for every namespace.
- Platform team expects resources to have a set of standard labels, that defines something like
alerting.xxx/severity,alerting.xxx/slack-channelalong side theapp.kubernetes.io/name,app.kubernetes.io/componentones. - Complex Rules are defined to join with these labels and generate alerts with variables based on labels defined by teams. Queries become even more complex as you want to join with more than one label.
Complex Queries Example
Deployment has not matched the expected number of replicas.
-
Using only 1 label for joins
( kube_deployment_spec_replicas{} * on (deployment) group_left(label_app_kubernetes_io_name) kube_deployment_labels{ label_app_kubernetes_io_name="emojiapp", namespace=~"emoji"} != kube_deployment_status_replicas_available{} * on (deployment) group_left(label_app_kubernetes_io_name) kube_deployment_labels{ label_app_kubernetes_io_name="emojiapp", namespace=~"emoji"} ) and ( changes(kube_deployment_status_replicas_updated{}[10m] ) * on (deployment) group_left(label_app_kubernetes_io_name) kube_deployment_labels{ label_app_kubernetes_io_name="emojiapp", namespace=~"emoji"} == 0 ) -
Using 2 label for joins
( kube_deployment_spec_replicas{} * on (deployment) group_left(label_app_kubernetes_io_name, label_alerting_severity) kube_deployment_labels{ label_app_kubernetes_io_name="emojiapp", label_alerting_severity="critical", namespace=~"emoji"} != kube_deployment_status_replicas_available{} * on (deployment) group_left(label_app_kubernetes_io_name, label_alerting_severity) kube_deployment_labels{ label_app_kubernetes_io_name="emojiapp", label_alerting_severity="critical", namespace=~"emoji"} ) and ( changes(kube_deployment_status_replicas_updated{}[10m] ) * on (deployment) group_left(label_app_kubernetes_io_name, label_alerting_severity) kube_deployment_labels{ label_app_kubernetes_io_name="emojiapp", label_alerting_severity="critical", namespace=~"emoji"} == 0 )
Same query but with labels as part of metric series:
-
( kube_deployment_spec_replicas{label_app_kubernetes_io_name="emojiapp", namespace=~"emoji"} != kube_deployment_status_replicas_available{label_app_kubernetes_io_name="emojiapp", namespace=~"emoji"} ) and ( changes(kube_deployment_status_replicas_updated{label_app_kubernetes_io_name="emojiapp", namespace=~"emoji"}[10m] ) ) -
( kube_deployment_spec_replicas{label_app_kubernetes_io_name="emojiapp", label_alerting_severity="critical", namespace=~"emoji"} != kube_deployment_status_replicas_available{label_app_kubernetes_io_name="emojiapp", label_alerting_severity="critical", namespace=~"emoji"} ) and ( changes(kube_deployment_status_replicas_updated{label_app_kubernetes_io_name="emojiapp", label_alerting_severity="critical", namespace=~"emoji"}[10m] ) )
Goal
- Making it easier for Platform and Development teams to create queries, rules, alerts, and dashboards for resources running in Kubernetes without the need for complex joins to filter resources.
- Alleviate some pressure from the Prometheus Servers that are constantly running Joins for each alert rule.
Alternatives we tried
- Run recording rules to pre-compute the joins: Having to have a recording rule for each metric series generated by KSM is cumbersome, and adds a dependency between teams and platform teams to add any recording rule. Not ideal especially in a multi-tenant environment.
How does this change affect the cardinality of KSM
(increases, decreases, or does not change cardinality)
- Unless explicitly using the new flag, this PR has no change to generated metrics cardinality.
- Using the flag will add new labels for each metric series of each resource that has a label key that's whitelisted.
- Cardinality of the label values depends on how often these labels change for the same resource.
- For the use-case behind this new feature, whitelisted labels are often labels that don't change often. Admins should be cautious of what labels to whitelist.
Performance
- KSM already fetches each resource label during metric collection.
- Prometheus Performance shouldn't be affected as long as whitelisted labels are not changing constantly.
Misc
Which issue(s) this PR fixes:
- Fixes #1415
Relevant:
Notes:
- ⚠️ This is a Draft PR, the implementation is not final, PR is a working POC for
podsresource, and is yet to be discussed.
This seems useful
I would be okay with having something like this in KSM since it is hard to achieve the same result with scrape-time relabeling. We can have a flag like --series-labels-allowlist or --resource-labels-allowlist that will be empty by default to make sure the feature is backwards compatible. Maybe @mrueg or @dgrisonnet have some counterpoints though.
I would personally stand against that feature. The memory impact of that feature on kube-state-metrics and Prometheus will be enormous compared to the actual benefits it brings. Also, we have to keep in mind that kube-state-metrics is meant to expose metrics with a 1:1 matching with the Kubernetes API. I wouldn't expect the Pod Spec metrics to contain any information about the pod labels. Anything beyond that should be considered out of scope in my opinion.
The core problem that you are trying to solve is to make alerting simpler. This is a good idea, but I don't think injecting more data inside of timeseries is the right way to go here. Timeseries are a very expensive kind of data. The more information you put inside of them the more resources will be used to store and retrieve the data. Whereas alerts are cheap as long as the timeseries themselves have a low cardinality.
Recording rules would be my go-to way to solve the kind of problem you are having. Writing them is for sure tedious when you have a lot of components but this can be workaround by using an intermediate language to generate the recording rules for you and your teams. Some very common ones in the ecosystem are Jsonnet and Cue. For example, kubernetes-mixin is a project where that kind of technique is used to generic complex recording/alerting rules.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hello folks @fpetkovski , @dgrisonnet, @xmcqueen , I would like to upvote this feature request and ask if that could be implemented for pods in some hearest future, well I may also dive into the code suggested.
The case I'm working on involves otel-collector scraping prometheus metrics from KSM and exporting them into the commercial backend. The thing is that commercial backend charges for metric cardinality, so reporting metrics with such labels as pod_name and then do joins would cost a lot.
To avoid this excessive costs and shrink cardinality by aggregating before exporting metrics (well, with acceptable loosing of visibility of keeping data for each pod) it looks very promising to have this labelling on KSM and then remove extra labels (like pod_name, uid, etc) on otel-collector's processor.
/lifecycle reopen
/reopen
@kostz: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
My concerns from https://github.com/kubernetes/kube-state-metrics/issues/1758#issuecomment-1192710597 still applies.
You could today remove labels via the otel-collector relabeling config but more complex aggregation similar to what Prometheus has with recording rules is not possible: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/prometheusreceiver/README.md#unsupported-features. Maybe there is a way to do post-processing in Otel, but that is out of kube-state-metrics' scope.