serving
serving copied to clipboard
During rollout restart of a deployment the new pod gets terminated within seconds
What version of Knative?
1.8.1
Expected Behavior
I want to executed a rolling restart on a deployment such that it starts the new pod(s), waits until the pod(s) are running, and then terminates the old pod(s)
Actual Behavior
I noticed that when trying to do a rolling restart on a KServe or KNative deployment the new pod is immediately terminated and the old pod remains running.
Steps to Reproduce the Problem
We are running KServe 0.10.0 on top of KNative 1.8.1 with istio 1.13.3, on kubernetes 1.27
To narrow down where the issue might be I did the following:
1. Run a basic KServe example: sklearn-iris
This shows the behavior I described above. I won't go too much into detail here because the behavior shows up with a basic knative deployment as well as described below.
2. Run a basic KNative serving example: sklearn-iris
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello
spec:
template:
spec:
containers:
- image: ghcr.io/knative/helloworld-go:latest
ports:
- containerPort: 8080
env:
- name: TARGET
value: "World"
This starts a pod with 2 containers:
- queue-proxy
- user-container
Then I do a rolling restart:
$ kubectl rollout restart deploy -n <namespace> hello-00001-deployment
and I witness the same behavior: new pod gets terminated within seconds. Looking at the event timeline we see:
$ kubectl get events -n <namespace> --sort-by='.lastTimestamp'
10m Normal Created configuration/hello Created Revision "hello-00001"
10m Normal ScalingReplicaSet deployment/hello-00001-deployment Scaled up replica set hello-00001-deployment-64ddfc4766 to 1
10m Normal Created service/hello Created Route "hello"
10m Normal Created service/hello Created Configuration "hello"
10m Normal FinalizerUpdate route/hello Updated "hello" finalizers
10m Warning FinalizerUpdateFailed route/hello Failed to update finalizers for "hello": Operation cannot be fulfilled on routes.serving.knative.dev "hello": the object has been modified; please apply your changes to the latest version and try again
10m Warning InternalError revision/hello-00001 failed to update deployment "hello-00001-deployment": Operation cannot be fulfilled on deployments.apps "hello-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
10m Normal SuccessfulCreate replicaset/hello-00001-deployment-64ddfc4766 Created pod: hello-00001-deployment-64ddfc4766-zzrcb
10m Normal Pulled pod/hello-00001-deployment-64ddfc4766-zzrcb Successfully pulled image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest" in 343.203487ms (343.211177ms including waiting)
10m Normal Created pod/hello-00001-deployment-64ddfc4766-zzrcb Created container queue-proxy
10m Normal Pulling pod/hello-00001-deployment-64ddfc4766-zzrcb Pulling image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest"
10m Normal Started pod/hello-00001-deployment-64ddfc4766-zzrcb Started container user-container
10m Normal Created pod/hello-00001-deployment-64ddfc4766-zzrcb Created container user-container
10m Normal Pulled pod/hello-00001-deployment-64ddfc4766-zzrcb Container image "ghcr.io/knative/helloworld-go@sha256:d2af882a7ae7d49c61ae14a35dac42825ccf455ba616d8e5c1f4fa8fdcf09807" already present on machine
10m Normal Started pod/hello-00001-deployment-64ddfc4766-zzrcb Started container queue-proxy
32s Normal ScalingReplicaSet deployment/hello-00001-deployment Scaled up replica set hello-00001-deployment-7c46db99df to 1
32s Normal SuccessfulDelete replicaset/hello-00001-deployment-7c46db99df Deleted pod: hello-00001-deployment-7c46db99df-nxp9x
32s Normal SuccessfulCreate replicaset/hello-00001-deployment-7c46db99df Created pod: hello-00001-deployment-7c46db99df-nxp9x
32s Normal ScalingReplicaSet deployment/hello-00001-deployment Scaled down replica set hello-00001-deployment-7c46db99df to 0 from 1
31s Normal Created pod/hello-00001-deployment-7c46db99df-nxp9x Created container user-container
31s Normal Pulled pod/hello-00001-deployment-7c46db99df-nxp9x Container image "ghcr.io/knative/helloworld-go@sha256:d2af882a7ae7d49c61ae14a35dac42825ccf455ba616d8e5c1f4fa8fdcf09807" already present on machine
31s Normal Killing pod/hello-00001-deployment-7c46db99df-nxp9x Stopping container user-container
31s Normal Pulled pod/hello-00001-deployment-7c46db99df-nxp9x Successfully pulled image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest" in 421.949534ms (421.969854ms including waiting)
31s Normal Started pod/hello-00001-deployment-7c46db99df-nxp9x Started container user-container
31s Normal Pulling pod/hello-00001-deployment-7c46db99df-nxp9x Pulling image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest"
31s Normal Created pod/hello-00001-deployment-7c46db99df-nxp9x Created container queue-proxy
31s Normal Started pod/hello-00001-deployment-7c46db99df-nxp9x Started container queue-proxy
31s Normal Killing pod/hello-00001-deployment-7c46db99df-nxp9x Stopping container queue-proxy
2s Warning Unhealthy pod/hello-00001-deployment-7c46db99df-nxp9x Readiness probe failed: HTTP probe failed with statuscode: 503
2s Warning Unhealthy pod/hello-00001-deployment-7c46db99df-nxp9x Readiness probe failed: Get "http://10.245.142.234:8012/": dial tcp 10.245.142.234:8012: connect: connection refused
The first section is the startup of the knative service (10m). What I do see here is the error:
10m Warning InternalError revision/hello-00001 failed to update deployment "hello-00001-deployment": Operation cannot be fulfilled on deployments.apps "hello-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
However, the service starts up fine and can be queried without issue. Not sure if this is relevant for this problem but I also noticed this error in our KServe deployments.
The second part (32s and less) happens when running the rolling restart. We see no errors but we can see that it terminates the same pod it created and keeps the old one running.
3. Run a basic Kubernetes deployment
To rule out that it is a k8s specific issue is just created a basic nginx deployment and did a rolling restart on it:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
This deployment actually does restart the pod as expect:
- Creates new pod
- Waits until ready
- Terminates old pod
The event log for this deployment looks like:
2m4s Normal ScalingReplicaSet deployment/nginx-deployment Scaled up replica set nginx-deployment-7c68c5c8dc to 1
2m3s Normal Created pod/nginx-deployment-7c68c5c8dc-tbjcm Created container nginx
2m3s Normal SuccessfulCreate replicaset/nginx-deployment-7c68c5c8dc Created pod: nginx-deployment-7c68c5c8dc-tbjcm
2m3s Normal Started pod/nginx-deployment-7c68c5c8dc-tbjcm Started container nginx
2m3s Normal Pulled pod/nginx-deployment-7c68c5c8dc-tbjcm Container image "nginx:1.14.2" already present on machine
11s Normal ScalingReplicaSet deployment/nginx-deployment Scaled up replica set nginx-deployment-68fbb8c788 to 1
11s Normal Pulled pod/nginx-deployment-68fbb8c788-gw4cn Container image "nginx:1.14.2" already present on machine
11s Normal SuccessfulCreate replicaset/nginx-deployment-68fbb8c788 Created pod: nginx-deployment-68fbb8c788-gw4cn
11s Normal Created pod/nginx-deployment-68fbb8c788-gw4cn Created container nginx
10s Normal Started pod/nginx-deployment-68fbb8c788-gw4cn Started container nginx
9s Normal ScalingReplicaSet deployment/nginx-deployment Scaled down replica set nginx-deployment-7c68c5c8dc to 0 from 1
9s Normal SuccessfulDelete replicaset/nginx-deployment-7c68c5c8dc Deleted pod: nginx-deployment-7c68c5c8dc-tbjcm
9s Normal Killing pod/nginx-deployment-7c68c5c8dc-tbjcm Stopping container nginx
8s Normal SuccessfulDelete replicaset/nginx-deployment-7c68c5c8dc Deleted pod: nginx-deployment-7c68c5c8dc-tbjcm
This makes me thing the issue is with our KNative Serving setup.
- How can we further debug this issue?
- Why is there an internal error when first starting a knative service?
- How can I figure out why the new pod gets terminated within seconds?
The state of the hello world deployment right after starting it is as follows:
$ kubectl get all -n namespace
NAME READY STATUS RESTARTS AGE
pod/hello-00001-deployment-86b4d7554-8fr9k 3/3 Running 0 64s
pod/ml-pipeline-ui-artifact-6858df96c7-xzlzh 2/2 Running 0 5d8h
pod/nginx-deployment-68fbb8c788-gw4cn 1/1 Running 0 6h46m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/hello ExternalName <none> knative-local-gateway.istio-system.svc.foobar.net 80/TCP 60s
service/hello-00001 ClusterIP 10.1.9.188 <none> 80/TCP,443/TCP 64s
service/hello-00001-private ClusterIP 10.1.255.26 <none> 80/TCP,443/TCP,9090/TCP,9091/TCP,8022/TCP,8012/TCP 64s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/hello-00001-deployment 1/1 1 1 64s
NAME DESIRED CURRENT READY AGE
replicaset.apps/hello-00001-deployment-86b4d7554 1 1 1 64s
NAME LATESTCREATED LATESTREADY READY REASON
configuration.serving.knative.dev/hello hello-00001 hello-00001 True
NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON ACTUAL REPLICAS DESIRED REPLICAS
revision.serving.knative.dev/hello-00001 hello 1 True 1 0
NAME URL READY REASON
route.serving.knative.dev/hello http://hello.namespace.svc.cluster.local True
NAME URL LATESTCREATED LATESTREADY READY REASON
service.serving.knative.dev/hello http://hello.namespace.svc.cluster.local hello-00001 hello-00001 True
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/remove-lifecycle stale
Commenting because I'm running into this problem as well with the same steps (kubectl rollout restart)
- Knative Serving version 1.10.0
I've searched both the docs and through issues, and I thought it was related to the readiness/liveness probes, but configuring those didn't help either.
Is there perhaps any guidance on best practices for restarting a deployment without deleting a pod? What I've found to be the most reliable way is just deploying a new Knative Serving revision 😅
/reopen
(sorry forgot to in the previous issue)
@chrisyxlee: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
(sorry forgot to in the previous issue)
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
What version of Knative are you using?
What version of Knative are you using?
@dprotaso the original reporter wrote in the description:
We are running KServe 0.10.0 on top of KNative 1.8.1 with istio 1.13.3, on kubernetes 1.27
I'm also seeing this problem with the following versions:
- KNative Serving 1.10.6
- KServe 0.8
- Istio 1.17
- Kubernetes 1.26