volcano
volcano copied to clipboard
scheduling with update from v1.2.0 to v1.4.0
What happened:
volcano update from 1.2.0 to 1.4.0. With the newest version if there are not enough resources PodGroup
s are kept in Pending
phase and cluster autoscaler does not trigger to provision more resources.
Did I miss smth in the latest version?
What you expected to happen:
I was expecting it to work as in 1.2.0
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Volcano Version: 1.4.0
- Kubernetes version (use
kubectl version
): 1.22.2 - Cloud provider or hardware configuration: GKE
- OS (e.g. from /etc/os-release): Container OS
- Kernel (e.g.
uname -a
): - Install tools:
- Others:
/assign @Thor-wl
Well, pls give more details about your testing steps so that I can reproduce it. THX.
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Hi @Thor-wl. I am able to reproduce this issue using a GKE Kubernetes cluster with autoscaling enabled. Creating a podgroup that can't be satisfied with current resources is enough. Prior to v1.4.0, scaleUp
is triggered which can be seen in events. After v1.4.0, this event doesn't happen.
Not sure on how to reproduce it locally, but we have investigated it further on our side. It happens after this PR. With this, volcano has started putting custom reasons like Undetermined
into a pod's status.conditions.reason
field. Kubernetes Cluster Autoscaler uses the same field to detect ScaleUp needs. But it only checks Unschedulable
. Related piece of code at the Cluster Autoscaler can be seen here.
I tested it by reverting the PR on top of v1.5.0-beta and autoscaling worked as before.
I'd appreciate any help on solving this in volcano. Both autoscaling and batch-scheduling is important to our setup.
Hello we are having the same issues and would appreciate if there's an update on this issue.
Thanks, guys. Let me take a look at that.
Could we re-open and make an update here? Volcano is pretty much unusable with Cluster Autoscaler and Karpenter with the "Undetermined" reason. Is there any reason why we shouldnt revert the PR to gain back compatibility with the autoscaling\cloud eco-system? Would love to hear from the team on this.
@brickyard Of course, please update here. Maybe the scheduling reason enhanced in pr#1672 missed to consider the interacting between scheduler and the autoscaler. @Thor-wl please continue to work on this to fix it. if there is not a way to take care of both the autoscaling and scheduler reason enhancement. We need to revert to keep the compatibility firstly.
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗
This issue is still affecting Karpenter users. Can we re-open and find a way to set the pod status to Unschedulable
instead of "Undetermined"? Is there a reason it should be (or need to be) "Undetermined"?
I think there should be no issues with reverting #1672, as the intent was to provide more information to the user. But if it breaks compatibility with cluster autoscalers, that seems like a very steep price to pay for better logging. Maybe this PR could be re-submitted by just annotating the status message with this info, instead of changing the Unschedulable
reason?