Immediate Process Termination on Second Signal in queue-proxy Prevents Graceful Shutdown
Ask your question here:
In the main function of the queue-proxy, the system signal capturing is initialized via signals.NewContext(). However, I noticed that upon receiving a second signal, the process terminates immediately. This behavior raises concerns as it affects the graceful shutdown of queue-proxy and may cause accumulated requests to not be properly drained.
Could someone please clarify why the design opts for an immediate exit on the second signal? It seems that a more graceful shutdown might be beneficial to ensure that in-flight requests are fully processed.
Thank you for your help
I don't believe Kubernetes sends two signals to the queue proxy.
First we get a pre-stop hook that triggers the draining of requests. This is where we fail readiness for the Pod and wait for the network changes to propagate.
https://github.com/knative/serving/blob/a221c53d3a7eca8161d52efc86f2e900907e9ad5/pkg/queue/sharedmain/handlers.go#L123
Then we'll get the SIGTERM and trigger the http servers we have in the queue proxy to gracefully shutdown.
I don't believe kubernetes sends a second SIGTERM but instead a SIGKILL which you cannot prevent.
Is there something you're trying to do? Are you seeing different behaviour - if so do you have an example?
In my test scenario, although it doesn’t happen every time, there's a small chance that the pod exits unexpectedly. After I removed the code that handles receiving the signal a second time in SetupSignalHandler, the issue no longer occurred. I suspect that Kubernetes might send SIGTERM twice in certain situations, perhaps due to retries or something similar?
In my test scenario,
Are you able to share your example?
@mzychaco hi could you add some example to reproduce it?
@mzychaco hi could you add some example to reproduce it?
We made the following changes in pkg/queue/sharedmain/main.go, please note the comments:
// Blocks until we actually receive a TERM signal or one of the servers
// exits unexpectedly. We fold both signals together because we only want
// to act on the first of those to reach here.
select {
case err := <-errCh:
logger.Errorw("Failed to bring up queue-proxy, shutting down.", zap.Error(err))
return err
case <-d.Ctx.Done():
logger.Info("Received TERM signal, attempting to gracefully shutdown servers.")
// Here, we added additional logging,.
// And Then we ran a performance test using perf with 40 concurrent requests per second for 300 seconds. We observed that a small number of pods were killed shortly after printing the log message "Extra sleep 1 min".
// logger.Info("Extra sleep 1 min")
// time.Sleep(1 * time.Minute)
logger.Infof("Sleeping %v to allow K8s propagation of non-ready state", drainSleepDuration)
drainer.Drain()
for name, srv := range httpServers {
logger.Info("Shutting down server: ", name)
if err := srv.Shutdown(context.Background()); err != nil {
logger.Errorw("Failed to shutdown server", zap.String("server", name), zap.Error(err))
}
}
for name, srv := range tlsServers {
logger.Info("Shutting down server: ", name)
if err := srv.Shutdown(context.Background()); err != nil {
logger.Errorw("Failed to shutdown server", zap.String("server", name), zap.Error(err))
}
}
webterminal.CloseMelody()
logger.Info("Shutdown complete, exiting...")
}
k8s version:v1.27.8
@mzychaco do you have a repro example of the problem ?
We've seen in the past when the user container doesn't handle SIGTERM properly (usually python apps) the pod stays around longer and the SIGKILL is sent.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.