volcano
volcano copied to clipboard
Add uninqueueable reason in podgroup condition
Please merge API's PR https://github.com/volcano-sh/apis/pull/113 first, and then I need to update the go.mod and refresh the last commit.
Add un-inqueueable reson in podgroup condition if job is rejected to be enqueue, so that it is more clear when describe podgroup to see why job is pending.
This PR is about to change podgroup's pending condition caused by not enough queue's quota from:
message: '3/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailable'
reason: NotEnoughResources
to
message: '3/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailabl, origin reason: queue resource quota insufficient'
reason: NotInqueueable
test result
origin
spec:
minMember: 3
minResources:
count/pods: "3"
cpu: 1100m
memory: 200Mi
pods: "3"
requests.cpu: 1100m
requests.memory: 200Mi
minTaskMember:
master: 2
work: 1
queue: test
status:
conditions:
- lastTransitionTime: "2023-08-12T06:27:12Z"
message: '3/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailable'
reason: NotEnoughResources
status: "True"
transitionID: 35ec245e-f901-47a9-a2b1-ef0456505f86
type: Unschedulable
phase: Pending
➜ volcano git:(add_uninqueue_state) ✗ kubectl get events |grep minavailable-job-4d40d46d-8bab-4914-9ddc-7a8e2aeda95a
18s Normal Unschedulable podgroup/minavailable-job-4d40d46d-8bab-4914-9ddc-7a8e2aeda95a queue resource quota insufficient
19s Warning Unschedulable podgroup/minavailable-job-4d40d46d-8bab-4914-9ddc-7a8e2aeda95a 0/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailable
with this change
➜ volcano git:(add_uninqueue_state) ✗ # image with add_uninqueue_state-67bf4ad9a
➜ volcano git:(add_uninqueue_state) ✗ kubectl get pod -n volcano-system -l app=volcano-scheduler
NAME READY STATUS RESTARTS AGE
volcano-scheduler-5bc9875dbb-sjvvn 1/1 Running 0 3m35s
➜ volcano git:(add_uninqueue_state) ✗ kubectl get deployments.apps -n volcano-system volcano-scheduler -o yaml |grep "image:"
image: volcanosh/vc-scheduler:add_uninqueue_state-67bf4ad9a
➜ volcano git:(add_uninqueue_state) ✗ kubectl get podgroups.scheduling.volcano.
NAME STATUS MINMEMBER RUNNINGS AGE
minavailable-job-9374cf52-55a0-4fc5-bb2c-effacd5703d8 Pending 3 69s
➜ volcano git:(add_uninqueue_state) ✗ kubectl get events |grep minavailable-job-9374cf52-55a0-4fc5-bb2c-effacd5703d8
73s Normal Uninqueueable podgroup/minavailable-job-9374cf52-55a0-4fc5-bb2c-effacd5703d8 queue resource quota insufficient
74s Warning Unschedulable podgroup/minavailable-job-9374cf52-55a0-4fc5-bb2c-effacd5703d8 0/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailable
# yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
creationTimestamp: "2023-08-12T07:29:26Z"
generation: 3
name: minavailable-job-9374cf52-55a0-4fc5-bb2c-effacd5703d8
namespace: default
ownerReferences:
- apiVersion: batch.volcano.sh/v1alpha1
blockOwnerDeletion: true
controller: true
kind: Job
name: minavailable-job
uid: 9374cf52-55a0-4fc5-bb2c-effacd5703d8
resourceVersion: "263313"
uid: 750510bb-48a0-4cab-8d78-4e466c47b928
spec:
minMember: 3
minResources:
count/pods: "3"
cpu: 1100m
memory: 200Mi
pods: "3"
requests.cpu: 1100m
requests.memory: 200Mi
minTaskMember:
master: 2
work: 1
queue: test
status:
conditions:
- lastTransitionTime: "2023-08-12T07:30:32Z"
message: queue resource quota insufficient
reason: NotInqueueable
status: "True"
transitionID: 80bfabf1-df94-46cf-8d62-546080a83ae3
type: Unschedulable
phase: Pending
test result with job can not enqueue
/assign @wangyang0616 @hwdef @william-wang @Thor-wl
And another change is to append the origin error to msg so that both gang-unschedule info and origin reason displayed.
I think the pr is well intended, but I have two suggestions:
- this pr needs documentation
- as far as the current code is concerned, the hints are still too simple, we need hints similar to, there are xx nodes with insufficient cpu, xx nodes with insufficient gpu, xx nodes with unsatisfied memory, xx nodes with unsatisfied affinity
there are xx nodes with insufficient cpu, xx nodes with insufficient gpu, xx nodes with unsatisfied memory, xx nodes with unsatisfied affinity
I remember those infor existed in the past version, but some subsequent prs covered those code, and now those info missed.
the hints are still too simple
This pr just add un-enqueueable reson which does not include all cases. I know what you want, eg: issue https://github.com/volcano-sh/volcano/issues/2993. It is better to do that in another PR.
This pr just add un-enqueueable reson which does not include all cases. I know what you want, eg: issue https://github.com/volcano-sh/volcano/issues/2993. It is better to do that in another PR.
ok, I know, but I still want the docs. Because I do not know why we need this status.
ok, I know, but I still want the docs. Because I do not know why we need this status.
Yes, I will add it later.
Hi, @william-wang , can we push this improvement ahead by merge https://github.com/volcano-sh/apis/pull/113?
We need to refine following message to let user know the enqueue phase and the detail reason. @Monokaix "message: '3/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailabl, origin reason: queue resource quota insufficient' reason: NotInqueueable"
/assign @Monokaix
@Monokaix Could we release this in v1.9.0?
I think this is important
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by:
To complete the pull request process, please assign william-wang
You can assign the PR to them by writing /assign @william-wang
in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
We need to refine following message to let user know the enqueue phase and the detail reason. @Monokaix "message: '3/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailabl, origin reason: queue resource quota insufficient' reason: NotInqueueable"
@lowang-bh Has this been resolved?
We need to refine following message to let user know the enqueue phase and the detail reason. @Monokaix "message: '3/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailabl, origin reason: queue resource quota insufficient' reason: NotInqueueable"
@lowang-bh Has this been resolved?
It is already at this level. https://github.com/volcano-sh/volcano/pull/3045#issuecomment-1676195270
/priority important-longterm
@lowang-bh: The label(s) priority/
cannot be applied. These labels are supported: ``
In response to this:
/priority important-longterm
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.