volcano
volcano copied to clipboard
Add cooldown time support for preempt action
What would you like to be added:
user can set cool down time for preemptible job's pod by add some labels or annotations, to avoid some pods being preempted when they just started for a short time.
Why is this needed:
related to elastic scheduler, when we need to enable elastic training or serving, preemptible job's pods can be preempted or back to running repeatedly, if no cool down time set, these pods can be preempted again after they just started for a short time, this may cause service stability dropped.
Good idea. I think that is meaningful. Can you help give a design and implementation for that?
we already have a localized implementation, it's not complex so I simply summary it here:
- provide a new label/annotation named "volcano.sh/preempt_stable_time", whose value means the cool down time for preempt with unit second. This label/annotation can be set for entire vcjob or some dedicated tasks, if set to job, we'll transfer to all tasks' pods.
- add a plugin to participate in preempt action, ensure pods whose scheduled time after
now - preempt_stable_time
will be not in the result victims list
@Thor-wl Please have a review, if ok, I can submit a pr
provide a new label/annotation named "volcano.sh/preempt_stable_time", whose value means the cool down time for preempt with unit second.
This should be handled within scheduler's cache to avoid additional apiserver request :)
IMO, it's better to have a plugin to order victims by start time; and have a min-start-time parameter for that :)
This should be handled within scheduler's cache to avoid additional apiserver request :)
Thanks for the reminding! I'll take care of it.
IMO, it's better to have a plugin to order victims by start time; and have a min-start-time parameter for that :)
I'm not very sure about what can we do with the ordered victims, can you provide more details? thanks :)
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗