website Elaborate the possibility of container termination to complete before Endpoints reconciliation

trafficstars

This is a Feature Request The section [1] should mention the possibility of all containers being terminated before the Pod has been removed from associated Endpoint resources. As this could cause some failed connections until the Endpoint controller has completed reconciliation.

1 - https://github.com/kubernetes/website/blob/main/content/en/docs/concepts/workloads/pods/pod-lifecycle.md?plain=1#L415

What would you like to be added An elaboration on possible flows, rather than just the example flow. Clarify the potential of container termination to complete before Endpoints reconciliation.

Why is this needed To make Kubernetes users aware of a possible pitfall, causing minor network disruptions.

Oct 08 '22 20:10 Ziphone

How I'd tackle this:

update https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced to mention that container failure can lead to forced Pod termination (for example: if the restart policy is Never)
also update https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced to mention that clients might try sending traffic to containers of the failed Pod, even as that terminated Pod is no longer in a position to accept the traffic
- and that node-level traffic forwarding can cease as soon as the Pod is deleted from the API, separately from whether the container runtime has or hasn't shut down the actual containers.

/language en

Oct 08 '22 23:10 sftim

Sounds like a good idea, Tim.

I also think that the content on graceful Pod deletion [1] should reflect these issues, caused by eventual consistency. When performing a regular rolling update, there is also a potential for opening connections to terminating pods - as Kubernetes does not (from my understanding) enforce Endpoint updates to take place (and eventually propagate to service proxy) before Kubelet starts SIGTERM-ing the containers of the terminating Pod. Kubelet and Endpoint controller operate asynchronously, paying no attention to each other. If I'm wrong and an order is enforced , then I don't think it is reflected at the moment (also from [1]):

3. At the same time as the kubelet is starting graceful shutdown... 
... 
Pods that shut down slowly cannot continue to serve traffic as load balancers (like the service proxy) remove the Pod from the list of endpoints as soon as the termination grace period begins.

I've seen multiple articles [2] addressing this pitfall, and think it should be addressed or at least acknowledged by the official documentation.

1 - https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination 2 - https://engineering.rakuten.today/post/graceful-k8s-delpoyments/

Oct 08 '22 23:10 Ziphone

/triage accepted /sig network /priority backlog

Oct 29 '22 15:10 sftim

https://github.com/kubernetes/kubernetes/pull/110191#issuecomment-1142294392 says:

From what I can find within the code, K8S has always assumed that a pod without a readiness probe stays in ready=true until it has fully terminated. This PR fixes the case where a pod is defined with a readiness probe, is being terminated, and needs to run the readiness probe on termination.

We should document that Pods that back Services should have a readiness probe defined, wherever this problem behavior (sending traffic to terminating pods) is not wanted. Also, the app code should try to fail the readiness probe as soon as Pod termination is signalled.

We should mention this in https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination and also update https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-readiness-probe to link there for details.

Optionally, also update https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/ to link to https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

Oct 29 '22 15:10 sftim

Extra credit: add a diagram somewhere (might be easier once https://github.com/kubernetes/website/pull/36675 has merged)

Oct 29 '22 15:10 sftim

The overall idea here is to probe Pods for readiness, even during shutdown, and use that readiness information to drop backends out of Services more gracefully.

Because unexpected failures can happen, workloads should also be prepared to handle the case where a Pod disappears from the API, or the container fails hard, even whilst traffic directed to that Pod is in flight.

Oct 29 '22 15:10 sftim

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 27 '23 15:01 k8s-triage-robot

/remove-lifecycle stale

Apr 27 '23 07:04 vaibhav2107

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Apr 26 '24 07:04 k8s-triage-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 25 '24 08:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Aug 24 '24 09:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Sep 23 '24 10:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sep 23 '24 10:09 k8s-ci-robot

website website copied to clipboard

Elaborate the possibility of container termination to complete before Endpoints reconciliation

website
website copied to clipboard