aws-load-balancer-controller icon indicating copy to clipboard operation
aws-load-balancer-controller copied to clipboard

AWS load balancer controller continues to provide high cardinality unbounded metrics to prometheus endpoint

Open yodaflomaster opened this issue 3 years ago • 7 comments

Describe the bug After upgrading version of aws load balancer controller using helm, i keep seeing rest_client_request_latency_seconds histogram metric exposed on the Prometheus metrics endpoint. It includes a URL tag containing the URI of all API versions. It's about ~900 metrics. I've delete chart, check dependencies and redeploy. But the problem didn't go away https://github.com/kubernetes-sigs/controller-runtime/issues/1423 https://github.com/kubernetes-sigs/controller-runtime/pull/1587

...
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.001"} 0
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.002"} 0
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.004"} 0
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.008"} 0
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.016"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.032"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.064"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.128"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.256"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.512"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="+Inf"} 1
rest_client_request_latency_seconds_sum{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET"} 0.010152667
rest_client_request_latency_seconds_count{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/namespaces/%7Bnamespace%7D/configmaps/%7Bname%7D",verb="GET",le="0.001"} 0
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/namespaces/%7Bnamespace%7D/configmaps/%7Bname%7D",verb="GET",le="0.002"} 0
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/namespaces/%7Bnamespace%7D/configmaps/%7Bname%7D",verb="GET",le="0.004"} 127
...

Steps to reproduce:

Deploy the aws-load-balancer-controller using the Helm Chart with the ServiceMonitor disabled (serviceMonitor.enabled=false Chart value). Get metrics from the exposed Prometheus endpoint (Chart default, :8080/metrics).

Expected outcome:

The rest_client_request_latency_seconds metric either not being present at in the exposed metrics.

Environment:

  • AWS Load Balancer controller: v2.4.5
  • Chart version: 1.4.6
  • EKS: 1.21.14-eks-fb459a0

Additional Context: Here my chart values file, other values by default.

replicaCount: 2

image:
  repository: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller
  tag: v2.4.5
  pullPolicy: IfNotPresent

clusterName: main-eks-qa

fullnameOverride: aws-load-balancer-controller

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::############:role/aws-load-balancer-controller

podLabels:
  ######.####/instance: aws-load-balancer-controller

webhookTLS:
  caCert:
  cert:
  key:

disableIngressClassAnnotation: true

disableIngressGroupNameAnnotation: true

podDisruptionBudget:
  maxUnavailable: 1

serviceMonitor:
  enabled: false
  additionalLabels: {}
  interval: 1m

clusterSecretsPermissions:
  allowAllSecrets: false

yodaflomaster avatar Nov 25 '22 15:11 yodaflomaster

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 07 '23 23:03 k8s-triage-robot

/remove-lifecycle stale

yodaflomaster avatar Apr 04 '23 10:04 yodaflomaster

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 03 '23 11:07 k8s-triage-robot

/remove-lifecycle stale

yodaflomaster avatar Jul 03 '23 11:07 yodaflomaster

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 23 '24 18:01 k8s-triage-robot

/remove-lifecycle stale

yodaflomaster avatar Feb 06 '24 17:02 yodaflomaster

I've created #3645 which will allow the dropping of metrics; you'll be able to do the following in the Helm values to remove the rest client metrics once it's merged.

serviceMonitor:
  enabled: true
  metricRelabelings:
    - sourceLabels: ["__name__"]
      regex: ^rest_client_.+
      action: drop

stevehipwell avatar Apr 11 '24 16:04 stevehipwell

Delivered in v2.8.0

shraddhabang avatar May 20 '24 17:05 shraddhabang