serving icon indicating copy to clipboard operation
serving copied to clipboard

Unexpected retries with error message logged in activator

Open anomam opened this issue 2 years ago • 1 comments

/area networking

Ask your question here:

Hi @dprotaso , as requested in our slack discussion, I'm creating an issue here. Thank you for your help!

issue

We’re using a KafkaSource + knative Service to run long-running jobs (let's say up to 20 minutes runtime per request), and sometimes the requests get retried without the service showing signs that the request failed. In fact, when requests are retried (and often processed by a different service pod), the originals eventually finish as well. Our system ends up with requests being dual-processed. Our services are implemented using Python FastAPI, and the container entrypoint is managed using dumb-init to get proper termination behavior. The service pods are not being restarted or anything (due to OOM errors), from our observation.

Now, the temporary fix (that’s not acceptable in production) is to fix the number of service pods. If there are no pods being terminated, the issue does not occur, and no requests are retried. This is causing us issues in production and we can't really figure out what could be behind it.

We're using version 1.7.1 of Knative serving and eventing.

Are there any potential causes that we should investigate?

reproducing the issue

We've been able to reproduce the issue with some good level of reliability when these conditions are met

  • the pods are all scaled down to zero
  • we send a kafka message, which gets picked up by the kafkasource and triggers some pod scale up (via the activator)
  • in our service pod which receives the request:
    • we make multiple http calls to another service that take 10s to complete
    • at the same time we manually trigger the termination of the pod 2s after receiving the request (via a kubernetes python client), so before responses are received from the http calls
    • the http calls and request execution does complete thanks to graceful shutdown, but the pod logs clearly show a TERM signal received by both queue-proxy and our workload before the http calls complete
  • this generates the following error log in an activator pod, and we observed that immediately after this log is generated, another execution of the request is triggered in a different pod
  • the interesting thing is that the retry happens only once; so we get only 2 executions of that 1 request (one could imagine it would be retried a lot more; for instance we noticed that the kafkasource generally retries errors 5 times, so maybe it is an activator retry in this case..?).

The log from the activator

{"severity":"ERROR","timestamp":"2023-09-12T04:10:28.963720601Z","logger":"activator","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 21\nTCP: inuse 9 orphan 0 tw 22 alloc 908 mem 80\nUDP: inuse 0 mem 23\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"9402a71-dirty","knative.dev/controller":"activator","knative.dev/pod":"activator-54cdf744fb-4hdcf","knative.dev/key":"default/dummy-workflow-00001","error":"EOF","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:140\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:301\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\tknative.dev/serving/pkg/activator/handler/handler.go:143\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:96\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\tknative.dev/serving/pkg/activator/net/throttler.go:235\nknative.dev/serving/pkg/queue.(*Breaker).Maybe\n\tknative.dev/serving/pkg/queue/breaker.go:155\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\tknative.dev/serving/pkg/activator/net/throttler.go:225\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\tknative.dev/serving/pkg/activator/net/throttler.go:523\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/handler.go:89\nknative.dev/serving/pkg/activator/handler.(*ConcurrencyReporter).Handler.func1\n\tknative.dev/serving/pkg/activator/handler/concurrency_reporter.go:253\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngo.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP\n\[email protected]/plugin/ochttp/server.go:92\nknative.dev/serving/pkg/activator/handler.NewTracingHandler.func1\n\tknative.dev/serving/pkg/activator/handler/tracing_handler.go:38\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\nknative.dev/serving/pkg/http.(*RequestLogHandler).ServeHTTP\n\tknative.dev/serving/pkg/http/request_log.go:131\nknative.dev/serving/pkg/activator/handler.(*MetricHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/metric_handler.go:64\nknative.dev/serving/pkg/activator/handler.(*contextHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/context_handler.go:82\nknative.dev/serving/pkg/activator/handler.(*ProbeHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/probe_handler.go:46\nknative.dev/networking/pkg/http/probe.(*handler).ServeHTTP\n\tknative.dev/[email protected]/pkg/http/probe/handler.go:39\nknative.dev/serving/pkg/activator/handler.(*HealthHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/healthz_handler.go:44\ngolang.org/x/net/http2/h2c.h2cHandler.ServeHTTP\n\tgolang.org/x/[email protected]/http2/h2c/h2c.go:104\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2916\nnet/http.(*conn).serve\n\tnet/http/server.go:1966"}

Cleaning up the error message and the stacktrace from the log above:

error reverse proxying request; sockstat: sockets: used 21  
TCP: inuse 9 orphan 0 tw 22 alloc 908 mem 80  
UDP: inuse 0 mem 23  
UDPLITE: inuse 0  
RAW: inuse 0  
FRAG: inuse 0 memory 0
knative.dev/pkg/network.ErrorHandler.func1  
        knative.dev/[email protected]/network/error_handler.go:33  
knative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1  
        knative.dev/serving/pkg/activator/handler/handler.go:140  
net/http/httputil.(*ReverseProxy).ServeHTTP  
        net/http/httputil/reverseproxy.go:301  
knative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest  
        knative.dev/serving/pkg/activator/handler/handler.go:143  
knative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1  
        knative.dev/serving/pkg/activator/handler/handler.go:96  
knative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1  
        knative.dev/serving/pkg/activator/net/throttler.go:235  
knative.dev/serving/pkg/queue.(*Breaker).Maybe  
        knative.dev/serving/pkg/queue/breaker.go:155  
knative.dev/serving/pkg/activator/net.(*revisionThrottler).try  
        knative.dev/serving/pkg/activator/net/throttler.go:225  
knative.dev/serving/pkg/activator/net.(*Throttler).Try  
        knative.dev/serving/pkg/activator/net/throttler.go:523  
knative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP  
        knative.dev/serving/pkg/activator/handler/handler.go:89  
knative.dev/serving/pkg/activator/handler.(*ConcurrencyReporter).Handler.func1  
        knative.dev/serving/pkg/activator/handler/concurrency_reporter.go:253  
net/http.HandlerFunc.ServeHTTP  
        net/http/server.go:2084  
go.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP  
        [email protected]/plugin/ochttp/server.go:92  
knative.dev/serving/pkg/activator/handler.NewTracingHandler.func1  
        knative.dev/serving/pkg/activator/handler/tracing_handler.go:38  
net/http.HandlerFunc.ServeHTTP  
        net/http/server.go:2084  
knative.dev/serving/pkg/http.(*RequestLogHandler).ServeHTTP  
        knative.dev/serving/pkg/http/request_log.go:131  
knative.dev/serving/pkg/activator/handler.(*MetricHandler).ServeHTTP  
        knative.dev/serving/pkg/activator/handler/metric_handler.go:64  
knative.dev/serving/pkg/activator/handler.(*contextHandler).ServeHTTP  
        knative.dev/serving/pkg/activator/handler/context_handler.go:82  
knative.dev/serving/pkg/activator/handler.(*ProbeHandler).ServeHTTP  
        knative.dev/serving/pkg/activator/handler/probe_handler.go:46  
knative.dev/networking/pkg/http/probe.(*handler).ServeHTTP  
        knative.dev/[email protected]/pkg/http/probe/handler.go:39  
knative.dev/serving/pkg/activator/handler.(*HealthHandler).ServeHTTP  
        knative.dev/serving/pkg/activator/handler/healthz_handler.go:44  
golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP  
        golang.org/x/[email protected]/http2/h2c/h2c.go:104  
net/http.serverHandler.ServeHTTP  
        net/http/server.go:2916  
net/http.(*conn).serve  
        net/http/server.go:1966

anomam avatar Sep 14 '23 08:09 anomam

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Dec 14 '23 01:12 github-actions[bot]