serving icon indicating copy to clipboard operation
serving copied to clipboard

During rollout restart of a deployment the new pod gets terminated within seconds

Open gerritvd opened this issue 1 year ago • 14 comments

What version of Knative?

1.8.1

Expected Behavior

I want to executed a rolling restart on a deployment such that it starts the new pod(s), waits until the pod(s) are running, and then terminates the old pod(s)

Actual Behavior

I noticed that when trying to do a rolling restart on a KServe or KNative deployment the new pod is immediately terminated and the old pod remains running.

Steps to Reproduce the Problem

We are running KServe 0.10.0 on top of KNative 1.8.1 with istio 1.13.3, on kubernetes 1.27

To narrow down where the issue might be I did the following:

1. Run a basic KServe example: sklearn-iris

This shows the behavior I described above. I won't go too much into detail here because the behavior shows up with a basic knative deployment as well as described below.

2. Run a basic KNative serving example: sklearn-iris

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello
spec:
  template:
    spec:
      containers:
        - image: ghcr.io/knative/helloworld-go:latest
          ports:
            - containerPort: 8080
          env:
            - name: TARGET
              value: "World"

This starts a pod with 2 containers:

  1. queue-proxy
  2. user-container

Then I do a rolling restart:

$ kubectl rollout restart deploy -n <namespace> hello-00001-deployment

and I witness the same behavior: new pod gets terminated within seconds. Looking at the event timeline we see:

$ kubectl get events -n <namespace> --sort-by='.lastTimestamp'
10m         Normal    Created                        configuration/hello                                                   Created Revision "hello-00001"
10m         Normal    ScalingReplicaSet              deployment/hello-00001-deployment                                     Scaled up replica set hello-00001-deployment-64ddfc4766 to 1
10m         Normal    Created                        service/hello                                                         Created Route "hello"
10m         Normal    Created                        service/hello                                                         Created Configuration "hello"
10m         Normal    FinalizerUpdate                route/hello                                                           Updated "hello" finalizers
10m         Warning   FinalizerUpdateFailed          route/hello                                                           Failed to update finalizers for "hello": Operation cannot be fulfilled on routes.serving.knative.dev "hello": the object has been modified; please apply your changes to the latest version and try again
10m         Warning   InternalError                  revision/hello-00001                                                  failed to update deployment "hello-00001-deployment": Operation cannot be fulfilled on deployments.apps "hello-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
10m         Normal    SuccessfulCreate               replicaset/hello-00001-deployment-64ddfc4766                          Created pod: hello-00001-deployment-64ddfc4766-zzrcb
10m         Normal    Pulled                         pod/hello-00001-deployment-64ddfc4766-zzrcb                           Successfully pulled image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest" in 343.203487ms (343.211177ms including waiting)
10m         Normal    Created                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Created container queue-proxy
10m         Normal    Pulling                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Pulling image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest"
10m         Normal    Started                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Started container user-container
10m         Normal    Created                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Created container user-container
10m         Normal    Pulled                         pod/hello-00001-deployment-64ddfc4766-zzrcb                           Container image "ghcr.io/knative/helloworld-go@sha256:d2af882a7ae7d49c61ae14a35dac42825ccf455ba616d8e5c1f4fa8fdcf09807" already present on machine
10m         Normal    Started                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Started container queue-proxy
32s         Normal    ScalingReplicaSet              deployment/hello-00001-deployment                                     Scaled up replica set hello-00001-deployment-7c46db99df to 1
32s         Normal    SuccessfulDelete               replicaset/hello-00001-deployment-7c46db99df                          Deleted pod: hello-00001-deployment-7c46db99df-nxp9x
32s         Normal    SuccessfulCreate               replicaset/hello-00001-deployment-7c46db99df                          Created pod: hello-00001-deployment-7c46db99df-nxp9x
32s         Normal    ScalingReplicaSet              deployment/hello-00001-deployment                                     Scaled down replica set hello-00001-deployment-7c46db99df to 0 from 1
31s         Normal    Created                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Created container user-container
31s         Normal    Pulled                         pod/hello-00001-deployment-7c46db99df-nxp9x                           Container image "ghcr.io/knative/helloworld-go@sha256:d2af882a7ae7d49c61ae14a35dac42825ccf455ba616d8e5c1f4fa8fdcf09807" already present on machine
31s         Normal    Killing                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Stopping container user-container
31s         Normal    Pulled                         pod/hello-00001-deployment-7c46db99df-nxp9x                           Successfully pulled image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest" in 421.949534ms (421.969854ms including waiting)
31s         Normal    Started                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Started container user-container
31s         Normal    Pulling                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Pulling image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest"
31s         Normal    Created                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Created container queue-proxy
31s         Normal    Started                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Started container queue-proxy
31s         Normal    Killing                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Stopping container queue-proxy
2s          Warning   Unhealthy                      pod/hello-00001-deployment-7c46db99df-nxp9x                           Readiness probe failed: HTTP probe failed with statuscode: 503
2s       Warning   Unhealthy                      pod/hello-00001-deployment-7c46db99df-nxp9x                           Readiness probe failed: Get "http://10.245.142.234:8012/": dial tcp 10.245.142.234:8012: connect: connection refused

The first section is the startup of the knative service (10m). What I do see here is the error:

10m         Warning   InternalError                  revision/hello-00001                                                  failed to update deployment "hello-00001-deployment": Operation cannot be fulfilled on deployments.apps "hello-00001-deployment": the object has been modified; please apply your changes to the latest version and try again

However, the service starts up fine and can be queried without issue. Not sure if this is relevant for this problem but I also noticed this error in our KServe deployments.

The second part (32s and less) happens when running the rolling restart. We see no errors but we can see that it terminates the same pod it created and keeps the old one running.

3. Run a basic Kubernetes deployment

To rule out that it is a k8s specific issue is just created a basic nginx deployment and did a rolling restart on it:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

This deployment actually does restart the pod as expect:

  1. Creates new pod
  2. Waits until ready
  3. Terminates old pod

The event log for this deployment looks like:

2m4s        Normal    ScalingReplicaSet              deployment/nginx-deployment                                           Scaled up replica set nginx-deployment-7c68c5c8dc to 1
2m3s        Normal    Created                        pod/nginx-deployment-7c68c5c8dc-tbjcm                                 Created container nginx
2m3s        Normal    SuccessfulCreate               replicaset/nginx-deployment-7c68c5c8dc                                Created pod: nginx-deployment-7c68c5c8dc-tbjcm
2m3s        Normal    Started                        pod/nginx-deployment-7c68c5c8dc-tbjcm                                 Started container nginx
2m3s        Normal    Pulled                         pod/nginx-deployment-7c68c5c8dc-tbjcm                                 Container image "nginx:1.14.2" already present on machine
11s         Normal    ScalingReplicaSet              deployment/nginx-deployment                                           Scaled up replica set nginx-deployment-68fbb8c788 to 1
11s         Normal    Pulled                         pod/nginx-deployment-68fbb8c788-gw4cn                                 Container image "nginx:1.14.2" already present on machine
11s         Normal    SuccessfulCreate               replicaset/nginx-deployment-68fbb8c788                                Created pod: nginx-deployment-68fbb8c788-gw4cn
11s         Normal    Created                        pod/nginx-deployment-68fbb8c788-gw4cn                                 Created container nginx
10s         Normal    Started                        pod/nginx-deployment-68fbb8c788-gw4cn                                 Started container nginx
9s          Normal    ScalingReplicaSet              deployment/nginx-deployment                                           Scaled down replica set nginx-deployment-7c68c5c8dc to 0 from 1
9s          Normal    SuccessfulDelete               replicaset/nginx-deployment-7c68c5c8dc                                Deleted pod: nginx-deployment-7c68c5c8dc-tbjcm
9s          Normal    Killing                        pod/nginx-deployment-7c68c5c8dc-tbjcm                                 Stopping container nginx
8s         Normal    SuccessfulDelete               replicaset/nginx-deployment-7c68c5c8dc                                Deleted pod: nginx-deployment-7c68c5c8dc-tbjcm

This makes me thing the issue is with our KNative Serving setup.

  1. How can we further debug this issue?
  2. Why is there an internal error when first starting a knative service?
  3. How can I figure out why the new pod gets terminated within seconds?

gerritvd avatar Dec 04 '23 22:12 gerritvd

The state of the hello world deployment right after starting it is as follows:

$ kubectl get all -n namespace
NAME                                           READY   STATUS    RESTARTS   AGE
pod/hello-00001-deployment-86b4d7554-8fr9k     3/3     Running   0          64s
pod/ml-pipeline-ui-artifact-6858df96c7-xzlzh   2/2     Running   0          5d8h
pod/nginx-deployment-68fbb8c788-gw4cn          1/1     Running   0          6h46m

NAME                              TYPE           CLUSTER-IP     EXTERNAL-IP                                                             PORT(S)                                              AGE
service/hello                     ExternalName   <none>         knative-local-gateway.istio-system.svc.foobar.net   80/TCP                                               60s
service/hello-00001               ClusterIP      10.1.9.188     <none>                                                                  80/TCP,443/TCP                                       64s
service/hello-00001-private       ClusterIP      10.1.255.26    <none>                                                                  80/TCP,443/TCP,9090/TCP,9091/TCP,8022/TCP,8012/TCP   64s

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/hello-00001-deployment    1/1     1            1           64s

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/hello-00001-deployment-86b4d7554     1         1         1       64s

NAME                                      LATESTCREATED   LATESTREADY   READY   REASON
configuration.serving.knative.dev/hello   hello-00001     hello-00001   True    

NAME                                       CONFIG NAME   K8S SERVICE NAME   GENERATION   READY   REASON   ACTUAL REPLICAS   DESIRED REPLICAS
revision.serving.knative.dev/hello-00001   hello                            1            True             1                 0

NAME                              URL                                      READY   REASON
route.serving.knative.dev/hello   http://hello.namespace.svc.cluster.local   True    

NAME                                URL                                      LATESTCREATED   LATESTREADY   READY   REASON
service.serving.knative.dev/hello   http://hello.namespace.svc.cluster.local   hello-00001     hello-00001   True    

gerritvd avatar Dec 05 '23 01:12 gerritvd

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Mar 05 '24 01:03 github-actions[bot]

/remove-lifecycle stale

Commenting because I'm running into this problem as well with the same steps (kubectl rollout restart)

  • Knative Serving version 1.10.0

I've searched both the docs and through issues, and I thought it was related to the readiness/liveness probes, but configuring those didn't help either.

Is there perhaps any guidance on best practices for restarting a deployment without deleting a pod? What I've found to be the most reliable way is just deploying a new Knative Serving revision 😅

chrisyxlee avatar May 14 '24 21:05 chrisyxlee

/reopen

(sorry forgot to in the previous issue)

chrisyxlee avatar May 14 '24 21:05 chrisyxlee

@chrisyxlee: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

(sorry forgot to in the previous issue)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow[bot] avatar May 14 '24 21:05 knative-prow[bot]

What version of Knative are you using?

dprotaso avatar May 17 '24 17:05 dprotaso

What version of Knative are you using?

@dprotaso the original reporter wrote in the description:

We are running KServe 0.10.0 on top of KNative 1.8.1 with istio 1.13.3, on kubernetes 1.27

I'm also seeing this problem with the following versions:

  • KNative Serving 1.10.6
  • KServe 0.8
  • Istio 1.17
  • Kubernetes 1.26

sel-vcc avatar Jul 10 '24 14:07 sel-vcc