serving icon indicating copy to clipboard operation
serving copied to clipboard

Queue Proxy does not exit after draining and server shutdown

Open Legion2 opened this issue 2 years ago • 1 comments

What version of Knative?

1.3.2

Expected Behavior

When a pod is deleted, queue-proxy should drain the connections and then exit, and after that the user container should exit.

Actual Behavior

When a pod is deleted, queue-proxy drains the connections and shut down the server and prints Shutdown complete, exiting... in the logs. However, it does not exit and is killed by kubelet after deletionGracePeriodSeconds (300s).

Steps to Reproduce the Problem

Setup

  1. Setup Knative with Istio
  2. Create the following service:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: foo
  annotations:
    networking.knative.dev/disableAutoTLS: "true"
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: "1"
        autoscaling.knative.dev/maxScale: "10"
        autoscaling.knative.dev/targetUtilizationPercentage: "70"
    spec:
      containerConcurrency: 1
      containers:
      - name: api
        image: some-image
        resources:
          requests:
            memory: 400Mi
            cpu: 50m
          limits:
            memory: 1024Mi
        ports:
        - containerPort: 8080
          name: http1
  1. Put a relative constant load on the service: 0.1 requests per second with 200 ms Response time and spikes to 2 seconds response time
  2. The autoscaler should scale up and down the number of replicas.
  3. This produces many pods in deleting state, where the queue-proxy container is blocking the deletion, even after it has drained the connections

Legion2 avatar Apr 18 '22 14:04 Legion2

same

CharZhou avatar Jul 01 '22 15:07 CharZhou

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Sep 30 '22 01:09 github-actions[bot]

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close

/lifecycle stale

knative-prow-robot avatar Oct 30 '22 02:10 knative-prow-robot

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jan 30 '23 01:01 github-actions[bot]

Following up here - after the queue proxy finishes drain - Kubernetes will send a TERM signal to the user-container (your application)

The shutdown will be blocked if your applicaiton doesn't perform a graceful shutdown.

As an example this python server doesn't have a graceful shutdown

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-test
spec:
  template:
    spec:
      timeoutSeconds: 150
      containers:
        - image: python:3.9-slim
          command: ["python"]
          args: ["-m", "http.server", "8080"]

in contrast to nginx which does perform a graceful shutdown

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-test
spec:
  template:
    spec:
      timeoutSeconds: 150
      containers:
      - image: nginx
        ports:
        - containerPort: 80 

dprotaso avatar Mar 01 '23 22:03 dprotaso

Also the queue proxy has a 30s drain time - if a request arrives the drain time is reset

dprotaso avatar Mar 01 '23 22:03 dprotaso

We're experiencing the same problem with a 1.7.1 version of Knative serving. By deploying services showed by @dprotaso the nginx one is gracefully shutdown, the python one no.

Is there a "canonical" way to address this issue? Thank you

paoloyx avatar Mar 21 '23 13:03 paoloyx

Is there a "canonical" way to address this issue? Thank you

I'm not familiar with python enough - but generally user applications will want to listen to SIGTERM and perform a graceful exit.

I wouldn't feel comfortable for Knative to try to force quit the user container to exist since there could be important things that occur on shutdown.

dprotaso avatar Mar 21 '23 17:03 dprotaso

For anyone dealing with this issue in Python, try using dumb-init

ashrafguitoni avatar Sep 26 '23 04:09 ashrafguitoni

@danielrubin1989: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow[bot] avatar Nov 20 '23 09:11 knative-prow[bot]