serving Immediate Process Termination on Second Signal in queue-proxy Prevents Graceful Shutdown

Ask your question here:

In the main function of the queue-proxy, the system signal capturing is initialized via signals.NewContext(). However, I noticed that upon receiving a second signal, the process terminates immediately. This behavior raises concerns as it affects the graceful shutdown of queue-proxy and may cause accumulated requests to not be properly drained.

Could someone please clarify why the design opts for an immediate exit on the second signal? It seems that a more graceful shutdown might be beneficial to ensure that in-flight requests are fully processed.

Thank you for your help

Apr 02 '25 06:04 mzychaco

I don't believe Kubernetes sends two signals to the queue proxy.

First we get a pre-stop hook that triggers the draining of requests. This is where we fail readiness for the Pod and wait for the network changes to propagate.

https://github.com/knative/serving/blob/a221c53d3a7eca8161d52efc86f2e900907e9ad5/pkg/queue/sharedmain/handlers.go#L123

Then we'll get the SIGTERM and trigger the http servers we have in the queue proxy to gracefully shutdown.

I don't believe kubernetes sends a second SIGTERM but instead a SIGKILL which you cannot prevent.

Apr 18 '25 01:04 dprotaso

Is there something you're trying to do? Are you seeing different behaviour - if so do you have an example?

Apr 18 '25 01:04 dprotaso

In my test scenario, although it doesn’t happen every time, there's a small chance that the pod exits unexpectedly. After I removed the code that handles receiving the signal a second time in SetupSignalHandler, the issue no longer occurred. I suspect that Kubernetes might send SIGTERM twice in certain situations, perhaps due to retries or something similar?

Apr 18 '25 11:04 mzychaco

In my test scenario,

Are you able to share your example?

Apr 18 '25 16:04 dprotaso

@mzychaco hi could you add some example to reproduce it?

May 05 '25 14:05 skonto

@mzychaco hi could you add some example to reproduce it?

We made the following changes in pkg/queue/sharedmain/main.go, please note the comments:

	// Blocks until we actually receive a TERM signal or one of the servers
	// exits unexpectedly. We fold both signals together because we only want
	// to act on the first of those to reach here.
	select {
	case err := <-errCh:
		logger.Errorw("Failed to bring up queue-proxy, shutting down.", zap.Error(err))
		return err
	case <-d.Ctx.Done():
		logger.Info("Received TERM signal, attempting to gracefully shutdown servers.")

                 // Here, we added additional logging,.
                 // And Then we ran a performance test using perf with 40 concurrent requests per second for 300 seconds. We observed that a small number of pods were killed shortly after printing the log message "Extra sleep 1 min".
		// logger.Info("Extra sleep 1 min")
		// time.Sleep(1 * time.Minute)

		logger.Infof("Sleeping %v to allow K8s propagation of non-ready state", drainSleepDuration)
		drainer.Drain()

		for name, srv := range httpServers {
			logger.Info("Shutting down server: ", name)
			if err := srv.Shutdown(context.Background()); err != nil {
				logger.Errorw("Failed to shutdown server", zap.String("server", name), zap.Error(err))
			}
		}
		for name, srv := range tlsServers {
			logger.Info("Shutting down server: ", name)
			if err := srv.Shutdown(context.Background()); err != nil {
				logger.Errorw("Failed to shutdown server", zap.String("server", name), zap.Error(err))
			}
		}
		webterminal.CloseMelody()
		logger.Info("Shutdown complete, exiting...")
	}

k8s version：v1.27.8

May 06 '25 08:05 mzychaco

@mzychaco do you have a repro example of the problem ?

We've seen in the past when the user container doesn't handle SIGTERM properly (usually python apps) the pod stays around longer and the SIGKILL is sent.

May 09 '25 15:05 dprotaso

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Aug 08 '25 01:08 github-actions[bot]