volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Add cooldown time support for preempt action

Open flyhighzy opened this issue 2 years ago • 5 comments

What would you like to be added:

user can set cool down time for preemptible job's pod by add some labels or annotations, to avoid some pods being preempted when they just started for a short time.

Why is this needed:

related to elastic scheduler, when we need to enable elastic training or serving, preemptible job's pods can be preempted or back to running repeatedly, if no cool down time set, these pods can be preempted again after they just started for a short time, this may cause service stability dropped.

flyhighzy avatar Mar 11 '22 10:03 flyhighzy

Good idea. I think that is meaningful. Can you help give a design and implementation for that?

Thor-wl avatar Mar 14 '22 03:03 Thor-wl

we already have a localized implementation, it's not complex so I simply summary it here:

  1. provide a new label/annotation named "volcano.sh/preempt_stable_time", whose value means the cool down time for preempt with unit second. This label/annotation can be set for entire vcjob or some dedicated tasks, if set to job, we'll transfer to all tasks' pods.
  2. add a plugin to participate in preempt action, ensure pods whose scheduled time after now - preempt_stable_time will be not in the result victims list

@Thor-wl Please have a review, if ok, I can submit a pr

flyhighzy avatar Mar 18 '22 05:03 flyhighzy

provide a new label/annotation named "volcano.sh/preempt_stable_time", whose value means the cool down time for preempt with unit second.

This should be handled within scheduler's cache to avoid additional apiserver request :)

k82cn avatar Mar 18 '22 06:03 k82cn

IMO, it's better to have a plugin to order victims by start time; and have a min-start-time parameter for that :)

k82cn avatar Mar 18 '22 06:03 k82cn

This should be handled within scheduler's cache to avoid additional apiserver request :)

Thanks for the reminding! I'll take care of it.

IMO, it's better to have a plugin to order victims by start time; and have a min-start-time parameter for that :)

I'm not very sure about what can we do with the ordered victims, can you provide more details? thanks :)

flyhighzy avatar Mar 18 '22 07:03 flyhighzy

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Oct 29 '22 10:10 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar Dec 31 '22 21:12 stale[bot]