kube-state-metrics icon indicating copy to clipboard operation
kube-state-metrics copied to clipboard

Add an option to allow adding Kubernetes Labels as Prometheus Metric Labels per metric-series

Open sherifabdlnaby opened this issue 3 years ago • 3 comments

I created this Issue for discussing and tracking PR #1689 Below is a transcript of the PR 👇🏻

What this PR does / why we need it:

This PR Introduce a new flag --per-metric-labels-allowlist that its syntax works similar to --metric-labels-allowlist but instead of being a filter to add labels to kube_X_labels, it is a filter for K8S' labels that will be added to each metric time series of a resource.

Motivation

Motivation for this change can be better described from a Platform Team POV, who is responsible for the observability stack and provides it to multiple teams and/or multiple tenants.

The goal is to make it easier to define queries, alerts, and rules without the need for complex joins. Making the barrier for smaller less experienced teams smaller. As well as alleviate pressure from Prometheus Server that's constantly doing joins for each alert rule.

Use Case 1

  1. A Development Team wants to create alerts for the multiple components of their applications.
  2. Different Components have different alerts, severities, thresholds. (example, web pods, background consumers, different kinds of jobs) and as components live in the same namespace; filtering with a namespace is not feasible.
  3. You'll now have to use joins with kube_X_labels to filter for the specific resources. Complex queries become more complex, especially ones that had joins already.

Use Case 2

  1. Platform Team defining general, default rules for every namespace.
  2. Platform team expects resources to have a set of standard labels, that defines something like alerting.xxx/severity, alerting.xxx/slack-channel along side the app.kubernetes.io/name, app.kubernetes.io/component ones.
  3. Complex Rules are defined to join with these labels and generate alerts with variables based on labels defined by teams. Queries become even more complex as you want to join with more than one label.

Complex Queries Example

Deployment has not matched the expected number of replicas.

  • Using only 1 label for joins

    (
     kube_deployment_spec_replicas{} * on (deployment) group_left(label_app_kubernetes_io_name) kube_deployment_labels{ label_app_kubernetes_io_name="emojiapp", namespace=~"emoji"}
     !=
     kube_deployment_status_replicas_available{} * on (deployment) group_left(label_app_kubernetes_io_name) kube_deployment_labels{ label_app_kubernetes_io_name="emojiapp", namespace=~"emoji"}
    ) 
    and 
    (
     changes(kube_deployment_status_replicas_updated{}[10m] ) * on (deployment) group_left(label_app_kubernetes_io_name) kube_deployment_labels{ label_app_kubernetes_io_name="emojiapp", namespace=~"emoji"}
     ==  0
    )
    
  • Using 2 label for joins

    (
     kube_deployment_spec_replicas{} * on (deployment) group_left(label_app_kubernetes_io_name, label_alerting_severity) kube_deployment_labels{ label_app_kubernetes_io_name="emojiapp", label_alerting_severity="critical", namespace=~"emoji"}
     !=
     kube_deployment_status_replicas_available{} * on (deployment) group_left(label_app_kubernetes_io_name, label_alerting_severity) kube_deployment_labels{ label_app_kubernetes_io_name="emojiapp", label_alerting_severity="critical", namespace=~"emoji"}
    ) 
    and 
    (
     changes(kube_deployment_status_replicas_updated{}[10m] ) * on (deployment) group_left(label_app_kubernetes_io_name, label_alerting_severity) kube_deployment_labels{ label_app_kubernetes_io_name="emojiapp", label_alerting_severity="critical", namespace=~"emoji"}
     ==  0
    )
    

Same query but with labels as part of metric series:

  • (
      kube_deployment_spec_replicas{label_app_kubernetes_io_name="emojiapp", namespace=~"emoji"} 
      !=
      kube_deployment_status_replicas_available{label_app_kubernetes_io_name="emojiapp", namespace=~"emoji"}
    ) 
    and 
    (
      changes(kube_deployment_status_replicas_updated{label_app_kubernetes_io_name="emojiapp", namespace=~"emoji"}[10m] )
    )
    
  • (
      kube_deployment_spec_replicas{label_app_kubernetes_io_name="emojiapp", label_alerting_severity="critical", namespace=~"emoji"} 
      !=
      kube_deployment_status_replicas_available{label_app_kubernetes_io_name="emojiapp", label_alerting_severity="critical", namespace=~"emoji"}
    ) 
    and 
    (
      changes(kube_deployment_status_replicas_updated{label_app_kubernetes_io_name="emojiapp", label_alerting_severity="critical", namespace=~"emoji"}[10m] )
    )
    

Goal

  • Making it easier for Platform and Development teams to create queries, rules, alerts, and dashboards for resources running in Kubernetes without the need for complex joins to filter resources.
  • Alleviate some pressure from the Prometheus Servers that are constantly running Joins for each alert rule.

Alternatives we tried

  • Run recording rules to pre-compute the joins: Having to have a recording rule for each metric series generated by KSM is cumbersome, and adds a dependency between teams and platform teams to add any recording rule. Not ideal especially in a multi-tenant environment.

How does this change affect the cardinality of KSM

(increases, decreases, or does not change cardinality)

  • Unless explicitly using the new flag, this PR has no change to generated metrics cardinality.
  • Using the flag will add new labels for each metric series of each resource that has a label key that's whitelisted.
  • Cardinality of the label values depends on how often these labels change for the same resource.
  • For the use-case behind this new feature, whitelisted labels are often labels that don't change often. Admins should be cautious of what labels to whitelist.

Performance

  • KSM already fetches each resource label during metric collection.
  • Prometheus Performance shouldn't be affected as long as whitelisted labels are not changing constantly.

Misc

Which issue(s) this PR fixes:

  • Fixes #1415

Relevant:

Notes:

  • ⚠️ This is a Draft PR, the implementation is not final, PR is a working POC for pods resource, and is yet to be discussed.

sherifabdlnaby avatar Jun 10 '22 15:06 sherifabdlnaby

This seems useful

xmcqueen avatar Jul 14 '22 22:07 xmcqueen

I would be okay with having something like this in KSM since it is hard to achieve the same result with scrape-time relabeling. We can have a flag like --series-labels-allowlist or --resource-labels-allowlist that will be empty by default to make sure the feature is backwards compatible. Maybe @mrueg or @dgrisonnet have some counterpoints though.

fpetkovski avatar Jul 17 '22 11:07 fpetkovski

I would personally stand against that feature. The memory impact of that feature on kube-state-metrics and Prometheus will be enormous compared to the actual benefits it brings. Also, we have to keep in mind that kube-state-metrics is meant to expose metrics with a 1:1 matching with the Kubernetes API. I wouldn't expect the Pod Spec metrics to contain any information about the pod labels. Anything beyond that should be considered out of scope in my opinion.

The core problem that you are trying to solve is to make alerting simpler. This is a good idea, but I don't think injecting more data inside of timeseries is the right way to go here. Timeseries are a very expensive kind of data. The more information you put inside of them the more resources will be used to store and retrieve the data. Whereas alerts are cheap as long as the timeseries themselves have a low cardinality.

Recording rules would be my go-to way to solve the kind of problem you are having. Writing them is for sure tedious when you have a lot of components but this can be workaround by using an intermediate language to generate the recording rules for you and your teams. Some very common ones in the ecosystem are Jsonnet and Cue. For example, kubernetes-mixin is a project where that kind of technique is used to generic complex recording/alerting rules.

dgrisonnet avatar Jul 22 '22 15:07 dgrisonnet

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 20 '22 15:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 19 '22 16:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Dec 19 '22 16:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 19 '22 16:12 k8s-ci-robot

Hello folks @fpetkovski , @dgrisonnet, @xmcqueen , I would like to upvote this feature request and ask if that could be implemented for pods in some hearest future, well I may also dive into the code suggested.

The case I'm working on involves otel-collector scraping prometheus metrics from KSM and exporting them into the commercial backend. The thing is that commercial backend charges for metric cardinality, so reporting metrics with such labels as pod_name and then do joins would cost a lot.

To avoid this excessive costs and shrink cardinality by aggregating before exporting metrics (well, with acceptable loosing of visibility of keeping data for each pod) it looks very promising to have this labelling on KSM and then remove extra labels (like pod_name, uid, etc) on otel-collector's processor.

/lifecycle reopen

kostz avatar Feb 07 '23 12:02 kostz

/reopen

kostz avatar Feb 07 '23 12:02 kostz

@kostz: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 07 '23 12:02 k8s-ci-robot

My concerns from https://github.com/kubernetes/kube-state-metrics/issues/1758#issuecomment-1192710597 still applies.

You could today remove labels via the otel-collector relabeling config but more complex aggregation similar to what Prometheus has with recording rules is not possible: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/prometheusreceiver/README.md#unsupported-features. Maybe there is a way to do post-processing in Otel, but that is out of kube-state-metrics' scope.

dgrisonnet avatar Feb 07 '23 14:02 dgrisonnet