ingress-nginx
ingress-nginx copied to clipboard
Remove old ingress-rules metrics for prometheus scraping
What happened:
Once you update the ingress rule. The Ingress controller is still providing metrics for old rules (plus new rules), which increases cardinality and generates not-useful (dumb) data (for old removed rules) while Prometheus scrapes on the pod.
What you expected to happen:
Once the rules are updated or removed, the metrics from the old data should be removed, which reduces the cardinality and avoids providing not-useful data (for old removed/updated rules).
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
Kubernetes version (use kubectl version
): Not relevant
Environment:
-
Cloud provider or hardware configuration:
-
OS (e.g. from /etc/os-release): not relevant
-
Kernel (e.g.
uname -a
): not relevant -
Install tools: EKS, AKS and bare metal
-
Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
-
-
Basic cluster related info:
-
kubectl version
-
kubectl get nodes -o wide
-
-
How was the ingress-nginx-controller installed:
- If helm was used then please show output of
helm ls -A | grep -i ingress
- If helm was used then please show output of
helm -n <ingresscontrollernamespace> get values <helmreleasename>
- If helm was not used, then copy/paste the complete precise command used to install the controller, along with the flags and options used
- if you have more than one instance of the ingress-nginx-controller installed in the same cluster, please provide details for all the instances
- If helm was used then please show output of
How to reproduce this issue:
Add 100 rules, update the same rule, or reduce them to 10. The Ingress controller will provide the metrics data for old and new rules.
Increase in cardinality:
cat metrics | grep -v "#" |cut -d "{" -f1 | sort | uniq -c | sort -rn | head -n40
3048 nginx_ingress_controller_request_duration_seconds_bucket
2988 nginx_ingress_controller_response_duration_seconds_bucket
2988 nginx_ingress_controller_connect_duration_seconds_bucket
2820 nginx_ingress_controller_header_duration_seconds_bucket
2794 nginx_ingress_controller_response_size_bucket
2794 nginx_ingress_controller_request_size_bucket
2032 nginx_ingress_controller_bytes_sent_bucket
254 nginx_ingress_controller_response_size_sum
254 nginx_ingress_controller_response_size_count
254 nginx_ingress_controller_requests
254 nginx_ingress_controller_request_size_sum
254 nginx_ingress_controller_request_size_count
254 nginx_ingress_controller_request_duration_seconds_sum
254 nginx_ingress_controller_request_duration_seconds_count
254 nginx_ingress_controller_bytes_sent_sum
254 nginx_ingress_controller_bytes_sent_count
249 nginx_ingress_controller_response_duration_seconds_sum
249 nginx_ingress_controller_response_duration_seconds_count
249 nginx_ingress_controller_connect_duration_seconds_sum
249 nginx_ingress_controller_connect_duration_seconds_count
235 nginx_ingress_controller_header_duration_seconds_sum
235 nginx_ingress_controller_header_duration_seconds_count
After you restart the pod:
cat metrics | grep -v "#" |cut -d "{" -f1 | sort | uniq -c | sort -rn | head -n40
288 nginx_ingress_controller_response_duration_seconds_bucket
288 nginx_ingress_controller_request_duration_seconds_bucket
288 nginx_ingress_controller_header_duration_seconds_bucket
288 nginx_ingress_controller_connect_duration_seconds_bucket
264 nginx_ingress_controller_response_size_bucket
264 nginx_ingress_controller_request_size_bucket
192 nginx_ingress_controller_bytes_sent_bucket
24 nginx_ingress_controller_response_size_sum
24 nginx_ingress_controller_response_size_count
24 nginx_ingress_controller_response_duration_seconds_sum
24 nginx_ingress_controller_response_duration_seconds_count
24 nginx_ingress_controller_requests
24 nginx_ingress_controller_request_size_sum
24 nginx_ingress_controller_request_size_count
24 nginx_ingress_controller_request_duration_seconds_sum
24 nginx_ingress_controller_request_duration_seconds_count
24 nginx_ingress_controller_header_duration_seconds_sum
24 nginx_ingress_controller_header_duration_seconds_count
24 nginx_ingress_controller_connect_duration_seconds_sum
24 nginx_ingress_controller_connect_duration_seconds_count
24 nginx_ingress_controller_bytes_sent_sum
24 nginx_ingress_controller_bytes_sent_count
21 nginx_ingress_controller_ingress_upstream_latency_seconds
19 nginx_ingress_controller_orphan_ingress
7 nginx_ingress_controller_ingress_upstream_latency_seconds_sum
7 nginx_ingress_controller_ingress_upstream_latency_seconds_count
Anything else we need to know:
This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/help
@SilentEntity thanks for reporting this.
- Yes, you are right and this has been going on for a long time
- Another typical example is expired cert will continue showing up, even after the related ingress is deleted
- But personally I am waiting for clarity from someone on the aspect of the data being a timeseries. The context being, the old rule metrics being present and the metrics from a deleted ingress's cert being present are timeseries data that a user may continue to view in grafana (or get from raw prometheus), in future
So I don't think this is a bug unless we can discuss and triage it to be a bug. So lets wait for expert comments and opinions
/assign
@longwuyuan: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
In response to this:
/help
@SilentEntity thanks for reporting this.
- Yes, you are right and this has been going on for a long time
- Another typical example is expired cert will continue showing up, even after the related ingress is deleted
- But personally I am waiting for clarity from someone on the aspect of the data being a timeseries. The context being, the old rule metrics being present and the metrics from a deleted ingress's cert being present are timeseries data that a user may continue to view in grafana (or get from raw prometheus), in future
So I don't think this is a bug unless we can discuss and triage it to be a bug. So lets wait for expert comments and opinions
/assign
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-kind bug
This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev
on Kubernetes Slack.
Old or expired metrics data, anyhow won't be present in the new pod(while scaling) or restarted pod which will create discrepancies in the metrics or grafana
dashboard.
+1