aws-load-balancer-controller
aws-load-balancer-controller copied to clipboard
AWS load balancer controller continues to provide high cardinality unbounded metrics to prometheus endpoint
Describe the bug
After upgrading version of aws load balancer controller using helm, i keep seeing rest_client_request_latency_seconds histogram metric exposed on the Prometheus metrics endpoint. It includes a URL tag containing the URI of all API versions. It's about ~900 metrics. I've delete chart, check dependencies and redeploy. But the problem didn't go away
https://github.com/kubernetes-sigs/controller-runtime/issues/1423
https://github.com/kubernetes-sigs/controller-runtime/pull/1587
...
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.001"} 0
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.002"} 0
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.004"} 0
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.008"} 0
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.016"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.032"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.064"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.128"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.256"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.512"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="+Inf"} 1
rest_client_request_latency_seconds_sum{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET"} 0.010152667
rest_client_request_latency_seconds_count{url="https://172.20.0.1:443/api/v1/endpoints?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET"} 1
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/namespaces/%7Bnamespace%7D/configmaps/%7Bname%7D",verb="GET",le="0.001"} 0
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/namespaces/%7Bnamespace%7D/configmaps/%7Bname%7D",verb="GET",le="0.002"} 0
rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/api/v1/namespaces/%7Bnamespace%7D/configmaps/%7Bname%7D",verb="GET",le="0.004"} 127
...
Steps to reproduce:
Deploy the aws-load-balancer-controller using the Helm Chart with the ServiceMonitor disabled (serviceMonitor.enabled=false Chart value). Get metrics from the exposed Prometheus endpoint (Chart default, :8080/metrics).
Expected outcome:
The rest_client_request_latency_seconds metric either not being present at in the exposed metrics.
Environment:
- AWS Load Balancer controller: v2.4.5
- Chart version: 1.4.6
- EKS: 1.21.14-eks-fb459a0
Additional Context: Here my chart values file, other values by default.
replicaCount: 2
image:
repository: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller
tag: v2.4.5
pullPolicy: IfNotPresent
clusterName: main-eks-qa
fullnameOverride: aws-load-balancer-controller
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::############:role/aws-load-balancer-controller
podLabels:
######.####/instance: aws-load-balancer-controller
webhookTLS:
caCert:
cert:
key:
disableIngressClassAnnotation: true
disableIngressGroupNameAnnotation: true
podDisruptionBudget:
maxUnavailable: 1
serviceMonitor:
enabled: false
additionalLabels: {}
interval: 1m
clusterSecretsPermissions:
allowAllSecrets: false
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
I've created #3645 which will allow the dropping of metrics; you'll be able to do the following in the Helm values to remove the rest client metrics once it's merged.
serviceMonitor:
enabled: true
metricRelabelings:
- sourceLabels: ["__name__"]
regex: ^rest_client_.+
action: drop
Delivered in v2.8.0