Unexpected retries with error message logged in activator
/area networking
Ask your question here:
Hi @dprotaso , as requested in our slack discussion, I'm creating an issue here. Thank you for your help!
issue
We’re using a KafkaSource + knative Service to run long-running jobs (let's say up to 20 minutes runtime per request), and sometimes the requests get retried without the service showing signs that the request failed. In fact, when requests are retried (and often processed by a different service pod), the originals eventually finish as well. Our system ends up with requests being dual-processed. Our services are implemented using Python FastAPI, and the container entrypoint is managed using dumb-init to get proper termination behavior. The service pods are not being restarted or anything (due to OOM errors), from our observation.
Now, the temporary fix (that’s not acceptable in production) is to fix the number of service pods. If there are no pods being terminated, the issue does not occur, and no requests are retried. This is causing us issues in production and we can't really figure out what could be behind it.
We're using version 1.7.1 of Knative serving and eventing.
Are there any potential causes that we should investigate?
reproducing the issue
We've been able to reproduce the issue with some good level of reliability when these conditions are met
- the pods are all scaled down to zero
- we send a kafka message, which gets picked up by the kafkasource and triggers some pod scale up (via the activator)
- in our service pod which receives the request:
- we make multiple http calls to another service that take 10s to complete
- at the same time we manually trigger the termination of the pod 2s after receiving the request (via a kubernetes python client), so before responses are received from the http calls
- the http calls and request execution does complete thanks to graceful shutdown, but the pod logs clearly show a TERM signal received by both queue-proxy and our workload before the http calls complete
- this generates the following error log in an activator pod, and we observed that immediately after this log is generated, another execution of the request is triggered in a different pod
- the interesting thing is that the retry happens only once; so we get only 2 executions of that 1 request (one could imagine it would be retried a lot more; for instance we noticed that the kafkasource generally retries errors 5 times, so maybe it is an activator retry in this case..?).
The log from the activator
{"severity":"ERROR","timestamp":"2023-09-12T04:10:28.963720601Z","logger":"activator","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 21\nTCP: inuse 9 orphan 0 tw 22 alloc 908 mem 80\nUDP: inuse 0 mem 23\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"9402a71-dirty","knative.dev/controller":"activator","knative.dev/pod":"activator-54cdf744fb-4hdcf","knative.dev/key":"default/dummy-workflow-00001","error":"EOF","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:140\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:301\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\tknative.dev/serving/pkg/activator/handler/handler.go:143\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:96\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\tknative.dev/serving/pkg/activator/net/throttler.go:235\nknative.dev/serving/pkg/queue.(*Breaker).Maybe\n\tknative.dev/serving/pkg/queue/breaker.go:155\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\tknative.dev/serving/pkg/activator/net/throttler.go:225\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\tknative.dev/serving/pkg/activator/net/throttler.go:523\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/handler.go:89\nknative.dev/serving/pkg/activator/handler.(*ConcurrencyReporter).Handler.func1\n\tknative.dev/serving/pkg/activator/handler/concurrency_reporter.go:253\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngo.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP\n\[email protected]/plugin/ochttp/server.go:92\nknative.dev/serving/pkg/activator/handler.NewTracingHandler.func1\n\tknative.dev/serving/pkg/activator/handler/tracing_handler.go:38\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\nknative.dev/serving/pkg/http.(*RequestLogHandler).ServeHTTP\n\tknative.dev/serving/pkg/http/request_log.go:131\nknative.dev/serving/pkg/activator/handler.(*MetricHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/metric_handler.go:64\nknative.dev/serving/pkg/activator/handler.(*contextHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/context_handler.go:82\nknative.dev/serving/pkg/activator/handler.(*ProbeHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/probe_handler.go:46\nknative.dev/networking/pkg/http/probe.(*handler).ServeHTTP\n\tknative.dev/[email protected]/pkg/http/probe/handler.go:39\nknative.dev/serving/pkg/activator/handler.(*HealthHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/healthz_handler.go:44\ngolang.org/x/net/http2/h2c.h2cHandler.ServeHTTP\n\tgolang.org/x/[email protected]/http2/h2c/h2c.go:104\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2916\nnet/http.(*conn).serve\n\tnet/http/server.go:1966"}
Cleaning up the error message and the stacktrace from the log above:
error reverse proxying request; sockstat: sockets: used 21
TCP: inuse 9 orphan 0 tw 22 alloc 908 mem 80
UDP: inuse 0 mem 23
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
knative.dev/pkg/network.ErrorHandler.func1
knative.dev/[email protected]/network/error_handler.go:33
knative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1
knative.dev/serving/pkg/activator/handler/handler.go:140
net/http/httputil.(*ReverseProxy).ServeHTTP
net/http/httputil/reverseproxy.go:301
knative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest
knative.dev/serving/pkg/activator/handler/handler.go:143
knative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1
knative.dev/serving/pkg/activator/handler/handler.go:96
knative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1
knative.dev/serving/pkg/activator/net/throttler.go:235
knative.dev/serving/pkg/queue.(*Breaker).Maybe
knative.dev/serving/pkg/queue/breaker.go:155
knative.dev/serving/pkg/activator/net.(*revisionThrottler).try
knative.dev/serving/pkg/activator/net/throttler.go:225
knative.dev/serving/pkg/activator/net.(*Throttler).Try
knative.dev/serving/pkg/activator/net/throttler.go:523
knative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP
knative.dev/serving/pkg/activator/handler/handler.go:89
knative.dev/serving/pkg/activator/handler.(*ConcurrencyReporter).Handler.func1
knative.dev/serving/pkg/activator/handler/concurrency_reporter.go:253
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2084
go.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP
[email protected]/plugin/ochttp/server.go:92
knative.dev/serving/pkg/activator/handler.NewTracingHandler.func1
knative.dev/serving/pkg/activator/handler/tracing_handler.go:38
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2084
knative.dev/serving/pkg/http.(*RequestLogHandler).ServeHTTP
knative.dev/serving/pkg/http/request_log.go:131
knative.dev/serving/pkg/activator/handler.(*MetricHandler).ServeHTTP
knative.dev/serving/pkg/activator/handler/metric_handler.go:64
knative.dev/serving/pkg/activator/handler.(*contextHandler).ServeHTTP
knative.dev/serving/pkg/activator/handler/context_handler.go:82
knative.dev/serving/pkg/activator/handler.(*ProbeHandler).ServeHTTP
knative.dev/serving/pkg/activator/handler/probe_handler.go:46
knative.dev/networking/pkg/http/probe.(*handler).ServeHTTP
knative.dev/[email protected]/pkg/http/probe/handler.go:39
knative.dev/serving/pkg/activator/handler.(*HealthHandler).ServeHTTP
knative.dev/serving/pkg/activator/handler/healthz_handler.go:44
golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP
golang.org/x/[email protected]/http2/h2c/h2c.go:104
net/http.serverHandler.ServeHTTP
net/http/server.go:2916
net/http.(*conn).serve
net/http/server.go:1966
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.