ingress-nginx
ingress-nginx copied to clipboard
Document prometheus metrics
HELP nginx_ingress_controller_bytes_sent The the number of bytes sent to a client TYPE nginx_ingress_controller_bytes_sent histogram
Labels:
- controller_class
- controller_namespace
- controller_pod
- host
- ingress
- method
- namespace
- path
- service
- status
HELP nginx_ingress_controller_config_hash Running configuration hash actually running TYPE nginx_ingress_controller_config_hash gauge
Labels:
- controller_class
- controller_namespace
- controller_pod
HELP nginx_ingress_controller_config_last_reload_successful Whether the last configuration reload attempt was successful TYPE nginx_ingress_controller_config_last_reload_successful gauge
Labels:
- controller_class
- controller_namespace
- controller_pod
HELP nginx_ingress_controller_config_last_reload_successful_timestamp_seconds Timestamp of the last successful configuration reload. TYPE nginx_ingress_controller_config_last_reload_successful_timestamp_seconds gauge
Labels:
- controller_class
- controller_namespace
- controller_pod
HELP nginx_ingress_controller_ingress_upstream_latency_seconds Upstream service latency per Ingress TYPE nginx_ingress_controller_ingress_upstream_latency_seconds summary
Labels:
- controller_class
- controller_namespace
- controller_pod
- ingress
- namespace
- service
HELP nginx_ingress_controller_nginx_process_connections current number of client connections with state {reading, writing, waiting} TYPE nginx_ingress_controller_nginx_process_connections gauge
Labels:
- controller_class
- controller_namespace
- controller_pod
- state (reading, waiting, writing)
HELP nginx_ingress_controller_nginx_process_connections_total total number of connections with state {active, accepted, handled} TYPE nginx_ingress_controller_nginx_process_connections_total counter
Labels:
- controller_class
- controller_namespace
- controller_pod
- state (accepted, active, handled)
HELP nginx_ingress_controller_nginx_process_cpu_seconds_total Cpu usage in seconds TYPE nginx_ingress_controller_nginx_process_cpu_seconds_total counter
Labels:
- controller_class
- controller_namespace
- controller_pod
HELP nginx_ingress_controller_nginx_process_num_procs number of processes TYPE nginx_ingress_controller_nginx_process_num_procs gauge
Labels:
- controller_class
- controller_namespace
- controller_pod
HELP nginx_ingress_controller_nginx_process_oldest_start_time_seconds start time in seconds since 1970/01/01 TYPE nginx_ingress_controller_nginx_process_oldest_start_time_seconds gauge
Labels:
- controller_class
- controller_namespace
- controller_pod
HELP nginx_ingress_controller_nginx_process_read_bytes_total number of bytes read TYPE nginx_ingress_controller_nginx_process_read_bytes_total counter
Labels:
- controller_class
- controller_namespace
- controller_pod
HELP nginx_ingress_controller_nginx_process_requests_total total number of client requests TYPE nginx_ingress_controller_nginx_process_requests_total counter
Labels:
- controller_class
- controller_namespace
- controller_pod
HELP nginx_ingress_controller_nginx_process_resident_memory_bytes number of bytes of memory in use TYPE nginx_ingress_controller_nginx_process_resident_memory_bytes gauge
Labels:
- controller_class
- controller_namespace
- controller_pod
HELP nginx_ingress_controller_nginx_process_virtual_memory_bytes number of bytes of memory in use TYPE nginx_ingress_controller_nginx_process_virtual_memory_bytes gauge
Labels:
- controller_class
- controller_namespace
- controller_pod
HELP nginx_ingress_controller_nginx_process_write_bytes_total number of bytes written TYPE nginx_ingress_controller_nginx_process_write_bytes_total counter
Labels:
- controller_class
- controller_namespace
- controller_pod
HELP nginx_ingress_controller_request_duration_seconds The request processing time in milliseconds TYPE nginx_ingress_controller_request_duration_seconds histogram
Labels:
- controller_class
- controller_namespace
- controller_pod
- host
- ingress
- method
- namespace
- path
- service
- status
HELP nginx_ingress_controller_request_size The request length (including request line, header, and request body) TYPE nginx_ingress_controller_request_size histogram
Labels:
- controller_class
- controller_namespace
- controller_pod
- host
- ingress
- method
- namespace
- path
- service
- status
HELP nginx_ingress_controller_requests The total number of client requests. TYPE nginx_ingress_controller_requests counter
Labels:
- controller_class
- controller_namespace
- controller_pod
- ingress
- namespace
- status
Information about the Ingress controller POD:
- controller_class
- controller_namespace
- controller_pod
Information about the Ingress rule:
- ingress (name)
- namespace
- path (ingress path, not the complete URI in NGINX)
- service (service name)
Review missing nginx_upstream_requests_total metric
Looking through the list above, nginx_ingress_controller_requests
is actually pretty much what I want, and even better than the old nginx_upstream_requests_total
, it seems that with this one I truly have the namespace and ingress information. With the old metrics, I had the name of the upstream which was a concatenation of <namespace>-<service>-<port>
, which was tricky to handle if you had namespaces with dashes in their names.
Just looking through this - perhaps its not available with the move away from VTS? But we appear to have: latency, duration, size of requests, etc on a per service/upstream basis. However i'm not seeing anything about the number of requests to a service/upstream being available. I presume that's what you are planning to look at as part of https://github.com/kubernetes/ingress-nginx/issues/2924#issuecomment-413246750
Edit ignore me - looks to be available by ingress name rather than by service
nginx_ingress_controller_requests{app="ingress-nginx-ext",controller_class="nginx-ext",controller_namespace="ingress-nginx",controller_pod="XX",exported_namespace="XX",ingress="upstream-ingress-name",instance="XX:XX",job="kubernetes-pods",kubernetes_pod_name="nginx-ingress-controller-ext-XX",namespace="ingress-nginx",pod_template_hash="XX",status="200"}
What is the difference of these two?
nginx_ingress_controller_response_size_sum
nginx_ingress_controller_bytes_sent_sum
I think they are identical? So the metrics are duplicated for no clear benefit?
And these metrics are histograms with a very high cardinality with buckets that really do not make any sense:
{le="+Inf"} 47207
{le="0.005"} 0
{le="0.05"} 0
{le="0.25"} 0
{le="2.5"} 0
{le="0.01"} 0
{le="0.025"} 0
{le="0.1"} 0
{le="0.5"} 0
{le="1"} 0
{le="10"} 0
{le="5"}
There are no fractional bytes, so all data is in the +Inf
bucket. I think counting bytes in a (non-configurable) histogram makes no sense.
These particular bytes-based metrics will lead to a combinatoric explosion in Prometheus, creating too many time series, since they combine le
(12 series), method
(2-x series), path
(possibly infinite?), and status
(also possibly dozens).
So I think these should be collected as simple counters, not histograms:
nginx_ingress_controller_bytes_sent_bucket
nginx_ingress_controller_request_size_bucket
nginx_ingress_controller_response_size_bucket
I have to say the structure of the VTS metrics (after latest updates) was much better.
I'm trying to understand nginx_ingress_controller_ingress_upstream_latency_seconds_sum
how it can be a negative. I would assume the request doesn't time travel ⌛️
A an explanation would be appreciated.
Also is there an average available? I only saw quantiles – which is great btw.
@estahn this was fixed in 0.18.0 https://github.com/kubernetes/ingress-nginx/pull/2844
@aledbf Question about the metrics, we make use of the ingress annotation server snippet, to have a custom proxy_pass
to a non-k8s service in certain circumstances as we are currently in a migration phase (and to a normal k8s service in the default case). Is there currently any way to see metrics for this? I.e. how many requests got proxy_passed to the default k8s service and how many through our custom snippet?
Edit: From what I have found not and it is not a big deal as now we just added a prometheus exporter to our k8s app itself, so we can monitor overall traffic to the ingress as well as the traffic that actually reached the pods.
@aledbf
- I'm trying to figure out how to calculate the average for e.g.
response_duration
. Would this be correct?
sum(nginx_ingress_controller_response_duration_seconds_sum{ingress="$ingress"}) /
sum(nginx_ingress_controller_response_duration_seconds_count{ingress="$ingress"})
- In regards to
nginx_ingress_controller_request_duration_seconds_bucket
I understand that each bucket has the value of the previous bucket plus it's own. How is this being used?
@estahn You can use histograms only usefully by staggering the le
label. This can be done, for instance, in a Heatmap in histogram mode in Grafana, or by transforming the histogram to percentiles using the histogram_quantile
function.
Here's an example for the first case:
sum by (le)(
increase(
nginx_ingress_controller_request_duration_seconds_bucket{
controller_class =~ "$controller_class",
namespace =~ "$namespace",
ingress =~ "$ingress"
}[$interval]
)
)
Here's an example for the second case:
histogram_quantile(
0.99,
sum by (le)(
rate(
nginx_ingress_controller_request_duration_seconds_bucket{
controller_class =~ "$controller_class",
namespace =~ "$namespace",
ingress =~ "$ingress"
}[$interval]
)
)
)
In addition to nginx_ingress_controller_requests
, which captures aggregate metrics at the ingress level, are there any plans to expose metrics on a per-upstream-endpoint? That would be useful to support Horizontal Pod Autoscaling with custom metrics, since the ingress controller is positioned ideally to collect those metrics (as opposed to having every service pod expose HTTP metrics).
It looks as though the Lua monitoring may already collect those metrics but they're just not being exposed to Prometheus?
It looks as though the Lua monitoring may already collect those metrics but they're just not being exposed to Prometheus?
Yes
In addition to nginx_ingress_controller_requests, which captures aggregate metrics at the ingress level, are there any plans to expose metrics on a per-upstream-endpoint?
The problem with this (0.16.0 contains this feature) is the explosion of metrics because of the label cardinality.
We are exploring how to enable this in a controlled way to avoid this issue.
@luispollo
Sounds good, @aledbf. Is there a separate issue tracking this item? Thanks for the update.
IMHO metrics should work just like the most recent native Prometheus export of the VTS module works, with configurable buckets, with upstream metrics, etc.
It's just that special care has to be taken, that not all metrics have all label combinations. This will lead to DoS of the Prometheus server.
For instance, the upstreams/endpoints should probably not have all dimensions in terms of request method, request path, etc.
P.S. @aledbf Looking at the changes in #2701 and later, it looks like the focus was in removing labels related to client information (remoteAddr
, remoteUser
, etc.), whereas my question was about labels identifying the target upstream pods.
In particular, there's an endpoint
field from the Lua monitor that looks like it may have the info I'm after, and that is currently commented out in the labels:
https://github.com/kubernetes/ingress-nginx/blob/68357f8e671aaf4f6fad50d9f01fa2fe63e3c8ef/internal/ingress/metric/collectors/socket.go#L83-L98
It seems the cardinality of that label would only increase with the scale of your service pods, which I would hope is several orders of magnitude lower than the number of clients. Would you consider adding that label perhaps?
Would you consider adding that label perhaps?
This is one of the labels that cause the high cardinality of metrics.
Understood. Thanks for the quick reply.
I am having a max of 10s in my request latency, I've noticed the max bucket is 10s, can we have more bucket values to latency metric?
@aledbf any thoughts?
I'll echo some of the previous comments urging per-upstream metrics.
When having a service with N endpoints and one experiences latency or request errors in the aggregate, it's really helpful to be able to drill down to a specific upstream pod when troubleshooting.
For a lot of us this dimension might grow by a few hundred per day, whilst the User, RequestIP etc are in the millions.
The best might be to have this configurable. There is a big difference in cardinality of Users, RequestIPs etc from a corporate environment to a public API for example, and the former might be very willing to pay the price for having those metrics.
Question about metrics nginx_ingress_controller_success
and nginx_ingress_controller_errors
. For example i have a prometheus outputs:
nginx_ingress_controller_success{class="nginx",controller_revision_hash="2071021497",instance="10.9.22.25:10254",job="pods",kubernetes_namespace="kube-system",kubernetes_pod_name="nginx-ingress-lb-4v7v6",name="nginx-ingress-lb",namespace="kube-system"}
value=15000
and
nginx_ingress_controller_errors{class="service",controller_revision_hash="3065885245",instance="10.2.2.17:10254",job="pods",kubernetes_namespace="kube-system",kubernetes_pod_name="nginx-ingress-lb-service-f5z4g",name="nginx-ingress-lb-service",namespace="kube-system"}
vallue=1
So, what does that mean in practice?
Previous versions contained metric ingress_controller_success
with a label count=reloads
, like:
ingress_controller_success{count="reloads",instance="10.3.2.101:10254",job="pods",kubernetes_namespace="kube-system",kubernetes_pod_name="nginx-ingress-lb-4mzlf",name="nginx-ingress-lb"}
and it was clear. Now i have no idea what that means
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
I'm still struggling with the 10s max bucket. Any suggestions to resolve this?
Pretty sure nginx_ingress_controller_nginx_process_cpu_seconds_total
is reporting wrong values.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Can you help me to understand better how ngix-controller metrics work? I have this situation in my AKS cluster:
- ngix ingress-controller in 'infra' namespace
- ingress resource (with tls spec) in various namespace I don't understand why some metrics have namespace attribute set to 'infra' (controller ns) and not to ingress namespace. For example nginx_ingress_controller_ssl_expire_time_seconds. TLS spec are in ingress and secret resources defined in various ns different from 'infra'. Thanks for your help