ingress-nginx icon indicating copy to clipboard operation
ingress-nginx copied to clipboard

Remove old ingress-rules metrics for prometheus scraping

Open SilentEntity opened this issue 11 months ago • 7 comments

What happened:

Once you update the ingress rule. The Ingress controller is still providing metrics for old rules (plus new rules), which increases cardinality and generates not-useful (dumb) data (for old removed rules) while Prometheus scrapes on the pod.

What you expected to happen:

Once the rules are updated or removed, the metrics from the old data should be removed, which reduces the cardinality and avoids providing not-useful data (for old removed/updated rules).

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

Kubernetes version (use kubectl version): Not relevant

Environment:

  • Cloud provider or hardware configuration:

  • OS (e.g. from /etc/os-release): not relevant

  • Kernel (e.g. uname -a): not relevant

  • Install tools: EKS, AKS and bare metal

    • Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
  • Basic cluster related info:

    • kubectl version
    • kubectl get nodes -o wide
  • How was the ingress-nginx-controller installed:

    • If helm was used then please show output of helm ls -A | grep -i ingress
    • If helm was used then please show output of helm -n <ingresscontrollernamespace> get values <helmreleasename>
    • If helm was not used, then copy/paste the complete precise command used to install the controller, along with the flags and options used
    • if you have more than one instance of the ingress-nginx-controller installed in the same cluster, please provide details for all the instances

How to reproduce this issue:

Add 100 rules, update the same rule, or reduce them to 10. The Ingress controller will provide the metrics data for old and new rules.

Increase in cardinality:

cat metrics | grep -v "#" |cut -d "{" -f1  | sort | uniq -c | sort -rn | head -n40
3048 nginx_ingress_controller_request_duration_seconds_bucket
2988 nginx_ingress_controller_response_duration_seconds_bucket
2988 nginx_ingress_controller_connect_duration_seconds_bucket
2820 nginx_ingress_controller_header_duration_seconds_bucket
2794 nginx_ingress_controller_response_size_bucket
2794 nginx_ingress_controller_request_size_bucket
2032 nginx_ingress_controller_bytes_sent_bucket
 254 nginx_ingress_controller_response_size_sum
 254 nginx_ingress_controller_response_size_count
 254 nginx_ingress_controller_requests
 254 nginx_ingress_controller_request_size_sum
 254 nginx_ingress_controller_request_size_count
 254 nginx_ingress_controller_request_duration_seconds_sum
 254 nginx_ingress_controller_request_duration_seconds_count
 254 nginx_ingress_controller_bytes_sent_sum
 254 nginx_ingress_controller_bytes_sent_count
 249 nginx_ingress_controller_response_duration_seconds_sum
 249 nginx_ingress_controller_response_duration_seconds_count
 249 nginx_ingress_controller_connect_duration_seconds_sum
 249 nginx_ingress_controller_connect_duration_seconds_count
 235 nginx_ingress_controller_header_duration_seconds_sum
 235 nginx_ingress_controller_header_duration_seconds_count

After you restart the pod:

cat metrics | grep -v "#" |cut -d "{" -f1  | sort | uniq -c | sort -rn | head -n40
 288 nginx_ingress_controller_response_duration_seconds_bucket
 288 nginx_ingress_controller_request_duration_seconds_bucket
 288 nginx_ingress_controller_header_duration_seconds_bucket
 288 nginx_ingress_controller_connect_duration_seconds_bucket
 264 nginx_ingress_controller_response_size_bucket
 264 nginx_ingress_controller_request_size_bucket
 192 nginx_ingress_controller_bytes_sent_bucket
  24 nginx_ingress_controller_response_size_sum
  24 nginx_ingress_controller_response_size_count
  24 nginx_ingress_controller_response_duration_seconds_sum
  24 nginx_ingress_controller_response_duration_seconds_count
  24 nginx_ingress_controller_requests
  24 nginx_ingress_controller_request_size_sum
  24 nginx_ingress_controller_request_size_count
  24 nginx_ingress_controller_request_duration_seconds_sum
  24 nginx_ingress_controller_request_duration_seconds_count
  24 nginx_ingress_controller_header_duration_seconds_sum
  24 nginx_ingress_controller_header_duration_seconds_count
  24 nginx_ingress_controller_connect_duration_seconds_sum
  24 nginx_ingress_controller_connect_duration_seconds_count
  24 nginx_ingress_controller_bytes_sent_sum
  24 nginx_ingress_controller_bytes_sent_count
  21 nginx_ingress_controller_ingress_upstream_latency_seconds
  19 nginx_ingress_controller_orphan_ingress
   7 nginx_ingress_controller_ingress_upstream_latency_seconds_sum
   7 nginx_ingress_controller_ingress_upstream_latency_seconds_count

Anything else we need to know:

SilentEntity avatar Mar 01 '24 07:03 SilentEntity

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 01 '24 07:03 k8s-ci-robot

/help

@SilentEntity thanks for reporting this.

  • Yes, you are right and this has been going on for a long time
  • Another typical example is expired cert will continue showing up, even after the related ingress is deleted
  • But personally I am waiting for clarity from someone on the aspect of the data being a timeseries. The context being, the old rule metrics being present and the metrics from a deleted ingress's cert being present are timeseries data that a user may continue to view in grafana (or get from raw prometheus), in future

So I don't think this is a bug unless we can discuss and triage it to be a bug. So lets wait for expert comments and opinions

/assign

longwuyuan avatar Mar 01 '24 14:03 longwuyuan

@longwuyuan: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

@SilentEntity thanks for reporting this.

  • Yes, you are right and this has been going on for a long time
  • Another typical example is expired cert will continue showing up, even after the related ingress is deleted
  • But personally I am waiting for clarity from someone on the aspect of the data being a timeseries. The context being, the old rule metrics being present and the metrics from a deleted ingress's cert being present are timeseries data that a user may continue to view in grafana (or get from raw prometheus), in future

So I don't think this is a bug unless we can discuss and triage it to be a bug. So lets wait for expert comments and opinions

/assign

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 01 '24 14:03 k8s-ci-robot

/remove-kind bug

longwuyuan avatar Mar 01 '24 14:03 longwuyuan

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

github-actions[bot] avatar Apr 01 '24 01:04 github-actions[bot]

Old or expired metrics data, anyhow won't be present in the new pod(while scaling) or restarted pod which will create discrepancies in the metrics or grafana dashboard.

SilentEntity avatar Apr 03 '24 05:04 SilentEntity

+1

jakuboskera avatar Apr 15 '24 13:04 jakuboskera