volcano icon indicating copy to clipboard operation
volcano copied to clipboard

scheduling with update from v1.2.0 to v1.4.0

Open regadas opened this issue 3 years ago • 9 comments

What happened:

volcano update from 1.2.0 to 1.4.0. With the newest version if there are not enough resources PodGroups are kept in Pending phase and cluster autoscaler does not trigger to provision more resources.

Did I miss smth in the latest version?

What you expected to happen:

I was expecting it to work as in 1.2.0

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version: 1.4.0
  • Kubernetes version (use kubectl version): 1.22.2
  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release): Container OS
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

regadas avatar Oct 08 '21 14:10 regadas

/assign @Thor-wl

Thor-wl avatar Oct 19 '21 01:10 Thor-wl

Well, pls give more details about your testing steps so that I can reproduce it. THX.

Thor-wl avatar Nov 01 '21 01:11 Thor-wl

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Jan 30 '22 03:01 stale[bot]

Hi @Thor-wl. I am able to reproduce this issue using a GKE Kubernetes cluster with autoscaling enabled. Creating a podgroup that can't be satisfied with current resources is enough. Prior to v1.4.0, scaleUp is triggered which can be seen in events. After v1.4.0, this event doesn't happen.

Not sure on how to reproduce it locally, but we have investigated it further on our side. It happens after this PR. With this, volcano has started putting custom reasons like Undetermined into a pod's status.conditions.reason field. Kubernetes Cluster Autoscaler uses the same field to detect ScaleUp needs. But it only checks Unschedulable. Related piece of code at the Cluster Autoscaler can be seen here.

I tested it by reverting the PR on top of v1.5.0-beta and autoscaling worked as before.

I'd appreciate any help on solving this in volcano. Both autoscaling and batch-scheduling is important to our setup.

yolgun avatar Feb 15 '22 12:02 yolgun

Hello we are having the same issues and would appreciate if there's an update on this issue.

fadi-artera avatar Mar 22 '22 01:03 fadi-artera

Thanks, guys. Let me take a look at that.

Thor-wl avatar Mar 22 '22 01:03 Thor-wl

Could we re-open and make an update here? Volcano is pretty much unusable with Cluster Autoscaler and Karpenter with the "Undetermined" reason. Is there any reason why we shouldnt revert the PR to gain back compatibility with the autoscaling\cloud eco-system? Would love to hear from the team on this.

brickyard avatar May 06 '22 22:05 brickyard

@brickyard Of course, please update here. Maybe the scheduling reason enhanced in pr#1672 missed to consider the interacting between scheduler and the autoscaler. @Thor-wl please continue to work on this to fix it. if there is not a way to take care of both the autoscaling and scheduler reason enhancement. We need to revert to keep the compatibility firstly.

william-wang avatar May 07 '22 01:05 william-wang

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Aug 10 '22 03:08 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar Oct 14 '22 03:10 stale[bot]

This issue is still affecting Karpenter users. Can we re-open and find a way to set the pod status to Unschedulable instead of "Undetermined"? Is there a reason it should be (or need to be) "Undetermined"?

tgaddair avatar Dec 11 '22 02:12 tgaddair

I think there should be no issues with reverting #1672, as the intent was to provide more information to the user. But if it breaks compatibility with cluster autoscalers, that seems like a very steep price to pay for better logging. Maybe this PR could be re-submitted by just annotating the status message with this info, instead of changing the Unschedulable reason?

tgaddair avatar Dec 11 '22 03:12 tgaddair