serving
serving copied to clipboard
KPA scale-down killing a random pod
When pods are working hard, a scale down of a Service should not send kill signals to those pods, as it causes unnecessary overhead concerning setup/teardown and puts extra pressure on the system.
Proposal: set PodDeletionCost
on pods currently handling requests:
Currently available in beta since k8s 1.22, could knative temporarily add the controller.kubernetes.io/pod-deletion-cost
annotation to pods that are currently handling requests?
This would prevent long running requests (e.g. a complex deep learning ensemble) from being killed by a ReplicaSet scaledown.
Originally posted by @ddelange in https://github.com/knative/serving/issues/3886#issuecomment-1158727239
So in theory this could be done by having the KPA patch the annotation during reconcile, but I wonder if this would hit the frequent update scenario mentioned as something to avoid in the docs...
Alternatively, there's also the async component which can help with handling long-running requests.
I think they mean constructs like propagating cpu metrics to the annotation at regular intervals. With a reasonable stable-window
on the KPA and only patching the annotation before a scaledown would occur like they suggest, the load on kube api is negligible I think?
So in theory this could be done by having the KPA patch the annotation during reconcile
although I have limited go experience, @psschwei if you can point me to the relevant code I'd be happy to give this a shot?
or does this need more discussion/maintainer contribution?
Sorry for not responding sooner, I think I must've missed the notification for this one...
I think the biggest issue here would be the performance: if it doesn't impact performance too much, then I don't see too much of an issue (that's a big if though). For safety, we'd want to put it behind a feature gate (so that it'd be opt-in initially).
My initial thought was that this could be done in the KPA reconcile loop, specifically when the scaler makes its scaling decision. But my second thought was that during SKS reconciliation might make more sense. Hopefully that helps a bit getting started...
cc @nader-ziada @dprotaso in case they have other thoughts / comments
News from the k8s front: https://github.com/kubernetes/kubernetes/issues/107598#issuecomment-1210060884
Hi! 👋 Is there a way to get some additional eyes on the proposal? I'd love to learn about other ideas as well!
Might be worth opening a thread on Slack in the #serving-api
channel with a link to this issue.
conceptually I think its fine to use PodDeletionCost
, but it's not clear to me from the proposal here when/how this annotation will be updated. Is it based on the requests a pod is handling? what is the cycle of updating?
great questions, definitely food for thought!
does KPA know rps/concurrency per pod?
if yes, flow could look like:
- KPA has determined 3 pods will go down
- KPA sets a negative PodDeletionCost on the 3 pods with lowest rps/concurrency
- KPA initiates scale-down, those 3 pods are killed
that way there would be minimal annotation writes, and there's no need for cleanup of 'stale annotations'
Hi @nader-ziada and @psschwei 👋
Have you given this some thought? Did my comment clarify/raise more questions? Is it technically possible and, if yes, would that be an acceptable implementation approach?
I think this feature would bring a lot of value both in terms of cost savings and serving stability. I would love to see this get done, and can definitely also help get this implemented!
does KPA know rps/concurrency per pod?
Not exactly -- we scrape metrics from a subset of pods and then return the average across those pods (see here for an example).
in that case, any takes on the next best path of least resistance?
Nothing jumps out, unfortunately...
News from the k8s front: kubernetes/kubernetes#107598 (comment)
So do I understand correctly: when the general purpose feature goes GA in kubernetes, the KPA could exploit it relatively easily? A until then it's relatively hard/time consuming?
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/remove-lifecycle stale
/lifecycle frozen /triage accepted
It looks there's an Kubernetes API to perform eviction since 1.22+ (link) - it would be worth exploring the possibility of the autoscaler using this API prior to manipulating the Deployment's replica count
does KPA know rps/concurrency per pod?
Not exactly -- we scrape metrics from a subset of pods and then return the average across those pods (see here for an example).
Hi @dprotaso :wave:
What would be the advantage of creating an Eviction for a specific Pod before Deployment scale down, versus setting a PodDeletionCost on a specific Pod before Deployment scale down?
Don't both options have the same prerequisites?
This foremost requires scanning all the pods to find the ones to scale down. Which is pretty expensive and hence even for metric sampling we do not do that.
@vagababov thanks for your reply, that clears up how it currently works! on the relevant upstream issue, the consensus seems to be information on ALL pods will have to be acquired by an autoscaler system. I guess it just has to be cheap enough... If you have thoughts on that topic (the past few messages in that thread), I would appreciate you adding your thoughts there too! Especially as that is the proposed mechanic for all autoscalers, not just the vanilla HPA.
This foremost requires scanning all the pods to find the ones to scale down. Which is pretty expensive and hence even for metric sampling we do not do that.
I was thinking since we're sampling we could look at the recent sample and identify a potential pod to scale down.
does the knative queue-proxy sidecar have knowledge of any metric that could be used as value for the Pod's controller.kubernetes.io/pod-deletion-cost
annotation?
if yes, would it be an anti-pattern if the sidecar gets permission to patch its own Pod annotation periodically? I guess that's what the downward api is for?
- it would instantly solve the problem especially for long running tasks.
- it could be behind a feature flag, plus a user defined interval somewhere between
10s
and24h
or so. - it would be up to the user to find an interval which doesn't impact api server health, with a sensible default.
does the knative queue-proxy sidecar have knowledge of any metric that could be used as value for the Pod's
controller.kubernetes.io/pod-deletion-cost
annotation?
hi :wave: just checking if anyone subscribed here has some pointers to move this forward?
does the knative queue-proxy sidecar have knowledge of any metric that could be used as value for the Pod's
controller.kubernetes.io/pod-deletion-cost
annotation?if yes, would it be an anti-pattern if the sidecar gets permission to patch its own Pod annotation periodically? I guess that's what the downward api is for?
- it would instantly solve the problem especially for long running tasks.
- it could be behind a feature flag, plus a user defined interval somewhere between
10s
and24h
or so.- it would be up to the user to find an interval which doesn't impact api server health, with a sensible default.
@ddelange Sounds like a good idea, but I have question about periodically patching pod-deletion-cost
.
It is not synchronized with autoscaler, right?
So even if pods patch themselves periodically, then autoscaler can scale down pods that are not yet updated their status.
And if periodical patching is set to small interval, can it slow down kubernetes if pods patch themselves frequently?
What about relative patching - pod patches itself for example after ~1 second after inactivity if KPA and autoscaling.knative.dev/metric: "concurrency"
is used?
Hoping for this feature to be implemented soon ;) 🙂
Hello @gunpuz @ddelange and others,
On my use case, I have long-running jobs (~1 hour) and each pod can only handle 1 of them. In this scenario, killing the wrong pod is fatal, since you kill a running operation.
At least for this use case, I think it would be very useful using the existing proxy sidecar (that knows how many requests are in flight on that particular pod) to report this count during the signal to autoscaler. Then autoscaler can patch the deletion cost annotation with this information.
Thoughts?