ingress-nginx icon indicating copy to clipboard operation
ingress-nginx copied to clipboard

Document prometheus metrics

Open aledbf opened this issue 6 years ago • 66 comments

HELP nginx_ingress_controller_bytes_sent The the number of bytes sent to a client TYPE nginx_ingress_controller_bytes_sent histogram

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • host
  • ingress
  • method
  • namespace
  • path
  • service
  • status

HELP nginx_ingress_controller_config_hash Running configuration hash actually running TYPE nginx_ingress_controller_config_hash gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_config_last_reload_successful Whether the last configuration reload attempt was successful TYPE nginx_ingress_controller_config_last_reload_successful gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_config_last_reload_successful_timestamp_seconds Timestamp of the last successful configuration reload. TYPE nginx_ingress_controller_config_last_reload_successful_timestamp_seconds gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_ingress_upstream_latency_seconds Upstream service latency per Ingress TYPE nginx_ingress_controller_ingress_upstream_latency_seconds summary

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • ingress
  • namespace
  • service

HELP nginx_ingress_controller_nginx_process_connections current number of client connections with state {reading, writing, waiting} TYPE nginx_ingress_controller_nginx_process_connections gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • state (reading, waiting, writing)

HELP nginx_ingress_controller_nginx_process_connections_total total number of connections with state {active, accepted, handled} TYPE nginx_ingress_controller_nginx_process_connections_total counter

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • state (accepted, active, handled)

HELP nginx_ingress_controller_nginx_process_cpu_seconds_total Cpu usage in seconds TYPE nginx_ingress_controller_nginx_process_cpu_seconds_total counter

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_num_procs number of processes TYPE nginx_ingress_controller_nginx_process_num_procs gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_oldest_start_time_seconds start time in seconds since 1970/01/01 TYPE nginx_ingress_controller_nginx_process_oldest_start_time_seconds gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_read_bytes_total number of bytes read TYPE nginx_ingress_controller_nginx_process_read_bytes_total counter

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_requests_total total number of client requests TYPE nginx_ingress_controller_nginx_process_requests_total counter

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_resident_memory_bytes number of bytes of memory in use TYPE nginx_ingress_controller_nginx_process_resident_memory_bytes gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_virtual_memory_bytes number of bytes of memory in use TYPE nginx_ingress_controller_nginx_process_virtual_memory_bytes gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_write_bytes_total number of bytes written TYPE nginx_ingress_controller_nginx_process_write_bytes_total counter

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_request_duration_seconds The request processing time in milliseconds TYPE nginx_ingress_controller_request_duration_seconds histogram

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • host
  • ingress
  • method
  • namespace
  • path
  • service
  • status

HELP nginx_ingress_controller_request_size The request length (including request line, header, and request body) TYPE nginx_ingress_controller_request_size histogram

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • host
  • ingress
  • method
  • namespace
  • path
  • service
  • status

HELP nginx_ingress_controller_requests The total number of client requests. TYPE nginx_ingress_controller_requests counter

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • ingress
  • namespace
  • status

aledbf avatar Aug 10 '18 13:08 aledbf

Information about the Ingress controller POD:

  • controller_class
  • controller_namespace
  • controller_pod

Information about the Ingress rule:

  • ingress (name)
  • namespace
  • path (ingress path, not the complete URI in NGINX)
  • service (service name)

aledbf avatar Aug 10 '18 13:08 aledbf

Review missing nginx_upstream_requests_total metric

aledbf avatar Aug 15 '18 16:08 aledbf

Looking through the list above, nginx_ingress_controller_requests is actually pretty much what I want, and even better than the old nginx_upstream_requests_total, it seems that with this one I truly have the namespace and ingress information. With the old metrics, I had the name of the upstream which was a concatenation of <namespace>-<service>-<port>, which was tricky to handle if you had namespaces with dashes in their names.

andor44 avatar Aug 15 '18 16:08 andor44

Just looking through this - perhaps its not available with the move away from VTS? But we appear to have: latency, duration, size of requests, etc on a per service/upstream basis. However i'm not seeing anything about the number of requests to a service/upstream being available. I presume that's what you are planning to look at as part of https://github.com/kubernetes/ingress-nginx/issues/2924#issuecomment-413246750

Edit ignore me - looks to be available by ingress name rather than by service nginx_ingress_controller_requests{app="ingress-nginx-ext",controller_class="nginx-ext",controller_namespace="ingress-nginx",controller_pod="XX",exported_namespace="XX",ingress="upstream-ingress-name",instance="XX:XX",job="kubernetes-pods",kubernetes_pod_name="nginx-ingress-controller-ext-XX",namespace="ingress-nginx",pod_template_hash="XX",status="200"}

markfermor avatar Aug 15 '18 16:08 markfermor

What is the difference of these two? nginx_ingress_controller_response_size_sum nginx_ingress_controller_bytes_sent_sum

I think they are identical? So the metrics are duplicated for no clear benefit?

And these metrics are histograms with a very high cardinality with buckets that really do not make any sense:

{le="+Inf"}	47207
{le="0.005"}	0
{le="0.05"}	0
{le="0.25"}	0
{le="2.5"}	0
{le="0.01"}	0
{le="0.025"}	0
{le="0.1"}	0
{le="0.5"}	0
{le="1"}	0
{le="10"}	0
{le="5"}

There are no fractional bytes, so all data is in the +Inf bucket. I think counting bytes in a (non-configurable) histogram makes no sense.

These particular bytes-based metrics will lead to a combinatoric explosion in Prometheus, creating too many time series, since they combine le (12 series), method (2-x series), path (possibly infinite?), and status (also possibly dozens).

So I think these should be collected as simple counters, not histograms:

nginx_ingress_controller_bytes_sent_bucket
nginx_ingress_controller_request_size_bucket
nginx_ingress_controller_response_size_bucket

I have to say the structure of the VTS metrics (after latest updates) was much better.

towolf avatar Aug 17 '18 13:08 towolf

I'm trying to understand nginx_ingress_controller_ingress_upstream_latency_seconds_sum how it can be a negative. I would assume the request doesn't time travel ⌛️

image

A an explanation would be appreciated.

Also is there an average available? I only saw quantiles – which is great btw.

estahn avatar Aug 23 '18 04:08 estahn

@estahn this was fixed in 0.18.0 https://github.com/kubernetes/ingress-nginx/pull/2844

aledbf avatar Aug 23 '18 12:08 aledbf

@aledbf Question about the metrics, we make use of the ingress annotation server snippet, to have a custom proxy_pass to a non-k8s service in certain circumstances as we are currently in a migration phase (and to a normal k8s service in the default case). Is there currently any way to see metrics for this? I.e. how many requests got proxy_passed to the default k8s service and how many through our custom snippet?

Edit: From what I have found not and it is not a big deal as now we just added a prometheus exporter to our k8s app itself, so we can monitor overall traffic to the ingress as well as the traffic that actually reached the pods.

Globegitter avatar Sep 05 '18 07:09 Globegitter

@aledbf

  1. I'm trying to figure out how to calculate the average for e.g. response_duration. Would this be correct?
sum(nginx_ingress_controller_response_duration_seconds_sum{ingress="$ingress"}) /
 sum(nginx_ingress_controller_response_duration_seconds_count{ingress="$ingress"})
  1. In regards to nginx_ingress_controller_request_duration_seconds_bucket I understand that each bucket has the value of the previous bucket plus it's own. How is this being used?

estahn avatar Sep 06 '18 10:09 estahn

@estahn You can use histograms only usefully by staggering the le label. This can be done, for instance, in a Heatmap in histogram mode in Grafana, or by transforming the histogram to percentiles using the histogram_quantile function.

Here's an example for the first case:

sum by (le)(
  increase(
    nginx_ingress_controller_request_duration_seconds_bucket{
      controller_class =~ "$controller_class",
      namespace =~ "$namespace",
      ingress =~ "$ingress"
    }[$interval]
  )
)

image image image

Here's an example for the second case:

histogram_quantile(
  0.99,
  sum by (le)(
    rate(
      nginx_ingress_controller_request_duration_seconds_bucket{
        controller_class =~ "$controller_class",
        namespace =~ "$namespace",
        ingress =~ "$ingress"
      }[$interval]
    )
  )
)

image image

towolf avatar Sep 10 '18 11:09 towolf

In addition to nginx_ingress_controller_requests, which captures aggregate metrics at the ingress level, are there any plans to expose metrics on a per-upstream-endpoint? That would be useful to support Horizontal Pod Autoscaling with custom metrics, since the ingress controller is positioned ideally to collect those metrics (as opposed to having every service pod expose HTTP metrics).

It looks as though the Lua monitoring may already collect those metrics but they're just not being exposed to Prometheus?

luispollo avatar Sep 24 '18 22:09 luispollo

It looks as though the Lua monitoring may already collect those metrics but they're just not being exposed to Prometheus?

Yes

In addition to nginx_ingress_controller_requests, which captures aggregate metrics at the ingress level, are there any plans to expose metrics on a per-upstream-endpoint?

The problem with this (0.16.0 contains this feature) is the explosion of metrics because of the label cardinality.

We are exploring how to enable this in a controlled way to avoid this issue.

@luispollo

aledbf avatar Sep 24 '18 22:09 aledbf

Sounds good, @aledbf. Is there a separate issue tracking this item? Thanks for the update.

luispollo avatar Sep 25 '18 16:09 luispollo

IMHO metrics should work just like the most recent native Prometheus export of the VTS module works, with configurable buckets, with upstream metrics, etc.

It's just that special care has to be taken, that not all metrics have all label combinations. This will lead to DoS of the Prometheus server.

For instance, the upstreams/endpoints should probably not have all dimensions in terms of request method, request path, etc.

towolf avatar Sep 25 '18 18:09 towolf

P.S. @aledbf Looking at the changes in #2701 and later, it looks like the focus was in removing labels related to client information (remoteAddr, remoteUser, etc.), whereas my question was about labels identifying the target upstream pods.

In particular, there's an endpoint field from the Lua monitor that looks like it may have the info I'm after, and that is currently commented out in the labels: https://github.com/kubernetes/ingress-nginx/blob/68357f8e671aaf4f6fad50d9f01fa2fe63e3c8ef/internal/ingress/metric/collectors/socket.go#L83-L98

It seems the cardinality of that label would only increase with the scale of your service pods, which I would hope is several orders of magnitude lower than the number of clients. Would you consider adding that label perhaps?

luispollo avatar Sep 25 '18 21:09 luispollo

Would you consider adding that label perhaps?

This is one of the labels that cause the high cardinality of metrics.

aledbf avatar Sep 25 '18 22:09 aledbf

Understood. Thanks for the quick reply.

luispollo avatar Sep 25 '18 22:09 luispollo

I am having a max of 10s in my request latency, I've noticed the max bucket is 10s, can we have more bucket values to latency metric? screen shot 2018-09-27 at 21 52 10

@aledbf any thoughts?

rafaeljesus avatar Sep 27 '18 20:09 rafaeljesus

I'll echo some of the previous comments urging per-upstream metrics.

When having a service with N endpoints and one experiences latency or request errors in the aggregate, it's really helpful to be able to drill down to a specific upstream pod when troubleshooting.

For a lot of us this dimension might grow by a few hundred per day, whilst the User, RequestIP etc are in the millions.

The best might be to have this configurable. There is a big difference in cardinality of Users, RequestIPs etc from a corporate environment to a public API for example, and the former might be very willing to pay the price for having those metrics.

StianOvrevage avatar Nov 29 '18 09:11 StianOvrevage

Question about metrics nginx_ingress_controller_success and nginx_ingress_controller_errors. For example i have a prometheus outputs:

nginx_ingress_controller_success{class="nginx",controller_revision_hash="2071021497",instance="10.9.22.25:10254",job="pods",kubernetes_namespace="kube-system",kubernetes_pod_name="nginx-ingress-lb-4v7v6",name="nginx-ingress-lb",namespace="kube-system"}

value=15000

and

nginx_ingress_controller_errors{class="service",controller_revision_hash="3065885245",instance="10.2.2.17:10254",job="pods",kubernetes_namespace="kube-system",kubernetes_pod_name="nginx-ingress-lb-service-f5z4g",name="nginx-ingress-lb-service",namespace="kube-system"} 

vallue=1

So, what does that mean in practice? Previous versions contained metric ingress_controller_success with a label count=reloads, like:

ingress_controller_success{count="reloads",instance="10.3.2.101:10254",job="pods",kubernetes_namespace="kube-system",kubernetes_pod_name="nginx-ingress-lb-4mzlf",name="nginx-ingress-lb"}

and it was clear. Now i have no idea what that means

k0nstantinv avatar Dec 19 '18 10:12 k0nstantinv

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Mar 19 '19 10:03 fejta-bot

/remove-lifecycle stale

towolf avatar Apr 16 '19 09:04 towolf

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Jul 15 '19 10:07 fejta-bot

I'm still struggling with the 10s max bucket. Any suggestions to resolve this?

agolomoodysaada avatar Aug 14 '19 15:08 agolomoodysaada

Pretty sure nginx_ingress_controller_nginx_process_cpu_seconds_total is reporting wrong values.

towolf avatar Aug 30 '19 20:08 towolf

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Nov 28 '19 21:11 fejta-bot

/remove-lifecycle stale

frittentheke avatar Nov 29 '19 06:11 frittentheke

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Feb 27 '20 06:02 fejta-bot

/remove-lifecycle stale

frittentheke avatar Feb 27 '20 06:02 frittentheke

Can you help me to understand better how ngix-controller metrics work? I have this situation in my AKS cluster:

  • ngix ingress-controller in 'infra' namespace
  • ingress resource (with tls spec) in various namespace I don't understand why some metrics have namespace attribute set to 'infra' (controller ns) and not to ingress namespace. For example nginx_ingress_controller_ssl_expire_time_seconds. TLS spec are in ingress and secret resources defined in various ns different from 'infra'. Thanks for your help

marcoboffi avatar Mar 14 '20 14:03 marcoboffi