Missing latency by HTTP status code for Knative Serving with Kourier
Ask your question here:
I'm trying to monitor request latency for my Knative services broken down by HTTP response code (or response code class like 2xx, 4xx, 5xx), similar to how I can monitor RPS.
The Envoy metrics envoy_cluster_external_upstream_rq_time_sum and envoy_cluster_external_upstream_rq_time_count don't include the envoy_response_code or envoy_response_code_class labels, so I cannot break down latency by response status.
I've checked:
- envoy_cluster_external_upstream_rq_time_* - no response code labels
- envoy_cluster_upstream_rq_time_* - no response code labels
- envoy_http_downstream_rq_time_* - no cluster-specific or response code labels
- kn_revision_* metrics - these track autoscaler metrics but not request latency
What is the recommended way to monitor request latency per response code (or response code class) for Knative services?
Environment
- Knative Serving version: [your version]
- Ingress: Kourier (3scale-kourier-gateway)
- Monitoring: Prometheus + Grafana
cc @dsimansk do you know what metrics kourier emits?
Hi,
I think this is rather an envoy issue as envoy sits between client and server. My understanding is that the net-kourier-controller is unaware of details about the requests and can't know anything them.
I also just checked my envoy metrics (using 1.36) and also the metrics of the net-kourier-controller and couldn't find anything that would help, same as you describe in the issue description.
Not sure if there is anything we can do here
So how do people usually monitor their knative request latencies? What I ended up doing is putting Nginx in front of kourier but this adds an extra hop that I'd like to avoid.
Envoy (as part of kourier) exposes its default metrics, the latency is provided per service as part of the envoy_cluster_upstream_rq_time_bucket histogram.
You can use that to measure the latency, but as envoy doesn't include labels for latency per status code you can't get metric.
There is a similar issue in the envoy repo but it was closed 5 years ago, you could give that another bump if it's important to you:
Besides that, istios sidecar pods do have a metric istio_request_duration_milliseconds that also contains the status code:
istio_request_duration_milliseconds_bucket{reporter="destination",source_workload="unknown",source_canonical_service="unknown",source_canonical_revision="latest",source_workload_namespace="unknown",source_principal="unknown",source_app="unknown",source_version="unknown",source_cluster="unknown",destination_workload="httpbin",destination_workload_namespace="default",destination_principal="unknown",destination_app="httpbin",destination_version="",destination_service="httpbin.default.svc.cluster.local",destination_canonical_service="httpbin",destination_canonical_revision="latest",destination_service_name="httpbin",destination_service_namespace="default",destination_cluster="Kubernetes",request_protocol="http",response_code="404",grpc_response_status="",response_flags="-",connection_security_policy="none",le="0.5"} 0
You could try using istio with knative, its also supported.
You could also give distributed tracing a look, IIRC you could also filter your requests there by the response code.