serving icon indicating copy to clipboard operation
serving copied to clipboard

Feature Request: Add terminationGracePeriodSeconds field in liveness probe

Open gane5hvarma opened this issue 9 months ago • 20 comments

/area API /kind good-first-issue /kind spec

Describe the feature

The kubernetes pod spec allows us to mention terminationGracePeriodSeconds under livenessProbe for a specific container. This helps restarting a failed container. Current the knative service spec ignores this field.

docs for reference: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#probe-level-terminationgraceperiodseconds

My use case: The liveness probe for the container fails and the container receives a sigterm. The container code went into a deadlock and doesnt respect gracefull shutdown. Now the container waits for pod terminationGracePeriodSeconds which is set by knative. This is taking around 300 second for the pod to restart. I dont want this to setting because of request timeout, but want the container to restart as soon as liveness probe fails.

gane5hvarma avatar Mar 31 '25 14:03 gane5hvarma

This seems like a duplicate of https://github.com/knative/serving/issues/15599

flomedja avatar Apr 07 '25 23:04 flomedja

Hi @FloMedja this is not a duplicate of #15599. 15599 talks about deleting deployment and it talks about terminationGracePeriodSeconds in pod spec. This issue is about adding terminationGracePeriodSeconds in livenessProbe. This is need if the container get sigterm, it should restart as soon as possible. The pod wont be terminated.

gane5hvarma avatar Apr 08 '25 07:04 gane5hvarma

My bad. I get it.

flomedja avatar Apr 08 '25 14:04 flomedja

@flomedja what do you think of this feature?

gane5hvarma avatar Apr 14 '25 07:04 gane5hvarma

@skonto @dprotaso Was there a specific reason for ignoring the terminationGracePeriodSeconds at the container level ?

flomedja avatar Apr 18 '25 01:04 flomedja

So Knative by default sets the terminationGracePeriod to the revision's timeoutseconds. This handles the scenario where a request comes in around the time the pod is being terminated - thus we want to handle that request and not fail.

The kubernetes pod spec allows us to mention terminationGracePeriodSeconds under livenessProbe for a specific container.

I'm reading the K8s docs and I really don't see the motivation for this feature. Do you have a KEP link handy?

The container code went into a deadlock and doesnt respect gracefull shutdown

Which container? the queue proxy or the user's app?

dprotaso avatar Apr 18 '25 01:04 dprotaso

Ok - I found the KEP and it explained the use

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2238-liveness-probe-grace-period#summary

Yeah I think this makes sense to support.

From the KEP the TerminationGracePeriod in the probe is used for abnormal termination when the liveness probes fail.

The top-level TerminationGracePeriod is for normal shutdown.

dprotaso avatar Apr 18 '25 01:04 dprotaso

I think it should be on the user's app (and sidecar if defined) and not the queue-proxy. Do you see a use case to make it configurable for the queue-proxy too ?

@gane5hvarma was that what you have in mind ?

flomedja avatar Apr 18 '25 02:04 flomedja

Do you see a use case to make it configurable for the queue-proxy too ?

I don't believe we set a liveness probe on the queue proxy so we can skip it here

dprotaso avatar Apr 18 '25 02:04 dprotaso

Sounds good to me, I can take care of the issue 🙂.

Just for my understanding, is there a specific reason why we don't set a liveness probe on the queue-proxy?

flomedja avatar Apr 18 '25 02:04 flomedja

@dprotaso from what I understand the #15599 is talking about the terminationGracePeriodSeconds on the pod level and this one on the container level in the livenessProbe.

flomedja avatar Apr 18 '25 02:04 flomedja

@flomedja yeah

dprotaso avatar Apr 18 '25 02:04 dprotaso

I think it should be on the user's app (and sidecar if defined) and not the queue-proxy. Do you see a use case to make it configurable for the queue-proxy too ?

@gane5hvarma was that what you have in mind ?

My use case is for the user's app and not the queue-proxy. I dont think its required for queue-proxy since it handles graceful termination

gane5hvarma avatar Apr 21 '25 04:04 gane5hvarma

@gane5hvarma can I work on this issue?

EraKin575 avatar May 04 '25 13:05 EraKin575

I am working on the issue :) . I Should have a PR ready soon :)

flomedja avatar May 04 '25 13:05 flomedja

@dprotaso I am pretty much ready for pushing my MR. I asked you a little question on CNCF slack about startup probes validation for sidecar containers . After getting an answer, I will be ready to push the MR.

flomedja avatar May 14 '25 22:05 flomedja

@flomedja are you still working on this issue? There is a commit referenced but not sure how complete it is.

I would otherwise assign this issue myself to finalize it for the current milestone (we already missed 1.19 and 1.20).

linkvt avatar Nov 13 '25 09:11 linkvt

@linkvt would it be okay if I take over? I want to contribute to knative and this was the only good-first-issue tagged issue there is.

changminbark avatar Nov 15 '25 21:11 changminbark

@linkvt Sorry for the delay. I was waiting for an information from @dprotaso about the task and get sidetracked by some work afterwards. I will have a MR ready during this week.

flomedja avatar Nov 15 '25 21:11 flomedja

@linkvt I pushed the MR. As requested by @dprotaso I add an issue https://github.com/knative/serving/issues/16256 for adding validation for startup probes.

flomedja avatar Nov 21 '25 21:11 flomedja