serving KPA scale-down killing a random pod

When pods are working hard, a scale down of a Service should not send kill signals to those pods, as it causes unnecessary overhead concerning setup/teardown and puts extra pressure on the system.

Proposal: set PodDeletionCost on pods currently handling requests:

Currently available in beta since k8s 1.22, could knative temporarily add the controller.kubernetes.io/pod-deletion-cost annotation to pods that are currently handling requests?

This would prevent long running requests (e.g. a complex deep learning ensemble) from being killed by a ReplicaSet scaledown.

Originally posted by @ddelange in https://github.com/knative/serving/issues/3886#issuecomment-1158727239

Jun 30 '22 18:06 ddelange

So in theory this could be done by having the KPA patch the annotation during reconcile, but I wonder if this would hit the frequent update scenario mentioned as something to avoid in the docs...

Alternatively, there's also the async component which can help with handling long-running requests.

Jul 01 '22 15:07 psschwei

I think they mean constructs like propagating cpu metrics to the annotation at regular intervals. With a reasonable stable-window on the KPA and only patching the annotation before a scaledown would occur like they suggest, the load on kube api is negligible I think?

Jul 02 '22 09:07 ddelange

So in theory this could be done by having the KPA patch the annotation during reconcile

although I have limited go experience, @psschwei if you can point me to the relevant code I'd be happy to give this a shot?

or does this need more discussion/maintainer contribution?

Jul 31 '22 18:07 ddelange

Sorry for not responding sooner, I think I must've missed the notification for this one...

I think the biggest issue here would be the performance: if it doesn't impact performance too much, then I don't see too much of an issue (that's a big if though). For safety, we'd want to put it behind a feature gate (so that it'd be opt-in initially).

My initial thought was that this could be done in the KPA reconcile loop, specifically when the scaler makes its scaling decision. But my second thought was that during SKS reconciliation might make more sense. Hopefully that helps a bit getting started...

cc @nader-ziada @dprotaso in case they have other thoughts / comments

Aug 01 '22 13:08 psschwei

News from the k8s front: https://github.com/kubernetes/kubernetes/issues/107598#issuecomment-1210060884

Aug 10 '22 07:08 ddelange

Hi! 👋 Is there a way to get some additional eyes on the proposal? I'd love to learn about other ideas as well!

Sep 22 '22 19:09 ddelange

Might be worth opening a thread on Slack in the #serving-api channel with a link to this issue.

Sep 26 '22 17:09 psschwei

conceptually I think its fine to use PodDeletionCost , but it's not clear to me from the proposal here when/how this annotation will be updated. Is it based on the requests a pod is handling? what is the cycle of updating?

Sep 26 '22 18:09 nader-ziada

great questions, definitely food for thought!

does KPA know rps/concurrency per pod?

if yes, flow could look like:

KPA has determined 3 pods will go down
KPA sets a negative PodDeletionCost on the 3 pods with lowest rps/concurrency
KPA initiates scale-down, those 3 pods are killed

that way there would be minimal annotation writes, and there's no need for cleanup of 'stale annotations'

Sep 26 '22 19:09 ddelange

Hi @nader-ziada and @psschwei 👋

Have you given this some thought? Did my comment clarify/raise more questions? Is it technically possible and, if yes, would that be an acceptable implementation approach?

I think this feature would bring a lot of value both in terms of cost savings and serving stability. I would love to see this get done, and can definitely also help get this implemented!

Oct 16 '22 21:10 ddelange

does KPA know rps/concurrency per pod?

Not exactly -- we scrape metrics from a subset of pods and then return the average across those pods (see here for an example).

Oct 17 '22 14:10 psschwei

in that case, any takes on the next best path of least resistance?

Oct 17 '22 14:10 ddelange

Nothing jumps out, unfortunately...

Oct 20 '22 13:10 psschwei

News from the k8s front: kubernetes/kubernetes#107598 (comment)

So do I understand correctly: when the general purpose feature goes GA in kubernetes, the KPA could exploit it relatively easily? A until then it's relatively hard/time consuming?

Oct 24 '22 20:10 ddelange

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Jan 23 '23 01:01 github-actions[bot]

/remove-lifecycle stale

Jan 23 '23 05:01 ddelange

/lifecycle frozen /triage accepted

It looks there's an Kubernetes API to perform eviction since 1.22+ (link) - it would be worth exploring the possibility of the autoscaler using this API prior to manipulating the Deployment's replica count

Feb 10 '23 18:02 dprotaso

does KPA know rps/concurrency per pod?

Not exactly -- we scrape metrics from a subset of pods and then return the average across those pods (see here for an example).

Hi @dprotaso :wave:

What would be the advantage of creating an Eviction for a specific Pod before Deployment scale down, versus setting a PodDeletionCost on a specific Pod before Deployment scale down?

Don't both options have the same prerequisites?

Feb 10 '23 20:02 ddelange

This foremost requires scanning all the pods to find the ones to scale down. Which is pretty expensive and hence even for metric sampling we do not do that.

Feb 10 '23 20:02 vagababov

@vagababov thanks for your reply, that clears up how it currently works! on the relevant upstream issue, the consensus seems to be information on ALL pods will have to be acquired by an autoscaler system. I guess it just has to be cheap enough... If you have thoughts on that topic (the past few messages in that thread), I would appreciate you adding your thoughts there too! Especially as that is the proposed mechanic for all autoscalers, not just the vanilla HPA.

Mar 07 '23 07:03 ddelange

This foremost requires scanning all the pods to find the ones to scale down. Which is pretty expensive and hence even for metric sampling we do not do that.

I was thinking since we're sampling we could look at the recent sample and identify a potential pod to scale down.

Mar 07 '23 22:03 dprotaso

does the knative queue-proxy sidecar have knowledge of any metric that could be used as value for the Pod's controller.kubernetes.io/pod-deletion-cost annotation?

if yes, would it be an anti-pattern if the sidecar gets permission to patch its own Pod annotation periodically? I guess that's what the downward api is for?

it would instantly solve the problem especially for long running tasks.
it could be behind a feature flag, plus a user defined interval somewhere between 10s and 24h or so.
it would be up to the user to find an interval which doesn't impact api server health, with a sensible default.

May 21 '23 21:05 ddelange

does the knative queue-proxy sidecar have knowledge of any metric that could be used as value for the Pod's controller.kubernetes.io/pod-deletion-cost annotation?

hi :wave: just checking if anyone subscribed here has some pointers to move this forward?

Dec 25 '23 17:12 ddelange

does the knative queue-proxy sidecar have knowledge of any metric that could be used as value for the Pod's controller.kubernetes.io/pod-deletion-cost annotation?

if yes, would it be an anti-pattern if the sidecar gets permission to patch its own Pod annotation periodically? I guess that's what the downward api is for?

it would instantly solve the problem especially for long running tasks.

it could be behind a feature flag, plus a user defined interval somewhere between 10s and 24h or so.

it would be up to the user to find an interval which doesn't impact api server health, with a sensible default.

@ddelange Sounds like a good idea, but I have question about periodically patching pod-deletion-cost. It is not synchronized with autoscaler, right? So even if pods patch themselves periodically, then autoscaler can scale down pods that are not yet updated their status. And if periodical patching is set to small interval, can it slow down kubernetes if pods patch themselves frequently?

What about relative patching - pod patches itself for example after ~1 second after inactivity if KPA and autoscaling.knative.dev/metric: "concurrency" is used?

Hoping for this feature to be implemented soon ;) 🙂

Feb 08 '24 09:02 gunpuz

Hello @gunpuz @ddelange and others,

On my use case, I have long-running jobs (~1 hour) and each pod can only handle 1 of them. In this scenario, killing the wrong pod is fatal, since you kill a running operation.

At least for this use case, I think it would be very useful using the existing proxy sidecar (that knows how many requests are in flight on that particular pod) to report this count during the signal to autoscaler. Then autoscaler can patch the deletion cost annotation with this information.

Thoughts?

Apr 09 '24 17:04 fsedano

serving serving copied to clipboard

KPA scale-down killing a random pod

serving
serving copied to clipboard