serving icon indicating copy to clipboard operation
serving copied to clipboard

Human readable logging on timeout

Open marioloko opened this issue 10 months ago • 2 comments

Describe the feature

When queue proxy timeouts it shows the following trace:

{
  "severity": "ERROR",
  "timestamp": "2025-02-04T18:01:55.621921326Z",
  "logger": "queueproxy",
  "caller": "network/error_handler.go:33",
  "message": "error reverse proxying request; sockstat: sockets: used 9\nTCP: inuse 1 orphan 1 tw 14 alloc 444 mem 2299\nUDP: inuse 0 mem 4352\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n",
  "commit": "6a27004",
  "knative.dev/key": "vista/helloworld-go-00004",
  "knative.dev/pod": "helloworld-go-00004-deployment-5578b648b-kldq5",
  "error": "context canceled",
  "stacktrace": "knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:472\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3.2\n\tknative.dev/serving/pkg/queue/handler.go:65\nknative.dev/serving/pkg/queue.(*Breaker).Maybe\n\tknative.dev/serving/pkg/queue/breaker.go:155\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:63\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2171\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2171\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"
}

The error that it shows is:

error reverse proxying request; sockstat: sockets: used 9
TCP: inuse 1 orphan 1 tw 14 alloc 444 mem 2299
UDP: inuse 0 mem 4352
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

I guess that for most people, it is hard to correlate this error with a timeout due to timeoutSeconds. My proposal is to log or trace the timeout when it occurs in the following functions:

  1. tryTimeoutAndWriteError
  2. tryResponseStartTimeoutAndWriteError
  3. tryIdleTimeoutAndWriteError

In each of the function indicating that a timeout was triggered and which timeout was triggered. I think that it is a simple fix, that should not impact in performance as it should be quite infrequent and it can help people quite a log figuring out what is happening in their systems.

marioloko avatar Feb 04 '25 18:02 marioloko

/triage accepted

yeah this has caused confusion in the past

dprotaso avatar Feb 06 '25 15:02 dprotaso

Timeout correlation is one thing the other is that the output there is only helpful for getting the socket stats and the connection error just in case you hit some known issue e.g. low number of file descriptors, someone killed the connection (context cancelled) etc. Most of the time you need to check elsewhere than the qp tbh e.g. does your app take too long to process a request for example queued somewhere etc.

skonto avatar Feb 07 '25 14:02 skonto

@dprotaso is this still required? If so, I would like to help with it

thiagomedina avatar Sep 07 '25 16:09 thiagomedina

@thiagomedina sure

/assign @thiagomedina

dprotaso avatar Sep 08 '25 13:09 dprotaso

@thiagomedina related to this is what happens when the request is cancelled

https://github.com/knative/serving/issues/13746

dprotaso avatar Sep 08 '25 17:09 dprotaso

@thiagomedina related to this is what happens when the request is cancelled

#13746

thanks, I'll investigate it

thiagomedina avatar Sep 08 '25 22:09 thiagomedina