Human readable logging on timeout
Describe the feature
When queue proxy timeouts it shows the following trace:
{
"severity": "ERROR",
"timestamp": "2025-02-04T18:01:55.621921326Z",
"logger": "queueproxy",
"caller": "network/error_handler.go:33",
"message": "error reverse proxying request; sockstat: sockets: used 9\nTCP: inuse 1 orphan 1 tw 14 alloc 444 mem 2299\nUDP: inuse 0 mem 4352\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n",
"commit": "6a27004",
"knative.dev/key": "vista/helloworld-go-00004",
"knative.dev/pod": "helloworld-go-00004-deployment-5578b648b-kldq5",
"error": "context canceled",
"stacktrace": "knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:472\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3.2\n\tknative.dev/serving/pkg/queue/handler.go:65\nknative.dev/serving/pkg/queue.(*Breaker).Maybe\n\tknative.dev/serving/pkg/queue/breaker.go:155\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:63\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2171\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2171\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"
}
The error that it shows is:
error reverse proxying request; sockstat: sockets: used 9
TCP: inuse 1 orphan 1 tw 14 alloc 444 mem 2299
UDP: inuse 0 mem 4352
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
I guess that for most people, it is hard to correlate this error with a timeout due to timeoutSeconds. My proposal is to log or trace the timeout when it occurs in the following functions:
In each of the function indicating that a timeout was triggered and which timeout was triggered. I think that it is a simple fix, that should not impact in performance as it should be quite infrequent and it can help people quite a log figuring out what is happening in their systems.
/triage accepted
yeah this has caused confusion in the past
Timeout correlation is one thing the other is that the output there is only helpful for getting the socket stats and the connection error just in case you hit some known issue e.g. low number of file descriptors, someone killed the connection (context cancelled) etc. Most of the time you need to check elsewhere than the qp tbh e.g. does your app take too long to process a request for example queued somewhere etc.
@dprotaso is this still required? If so, I would like to help with it
@thiagomedina sure
/assign @thiagomedina
@thiagomedina related to this is what happens when the request is cancelled
https://github.com/knative/serving/issues/13746
@thiagomedina related to this is what happens when the request is cancelled
thanks, I'll investigate it