volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Add uninqueueable reason in podgroup condition

Open lowang-bh opened this issue 1 year ago • 22 comments

Please merge API's PR https://github.com/volcano-sh/apis/pull/113 first, and then I need to update the go.mod and refresh the last commit.

Add un-inqueueable reson in podgroup condition if job is rejected to be enqueue, so that it is more clear when describe podgroup to see why job is pending.

This PR is about to change podgroup's pending condition caused by not enough queue's quota from:

    message: '3/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailable'
    reason: NotEnoughResources

to

    message:  '3/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailabl, origin reason: queue resource quota insufficient'
    reason: NotInqueueable

test result

origin

spec:
  minMember: 3
  minResources:
    count/pods: "3"
    cpu: 1100m
    memory: 200Mi
    pods: "3"
    requests.cpu: 1100m
    requests.memory: 200Mi
  minTaskMember:
    master: 2
    work: 1
  queue: test
status:
  conditions:
  - lastTransitionTime: "2023-08-12T06:27:12Z"
    message: '3/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailable'
    reason: NotEnoughResources
    status: "True"
    transitionID: 35ec245e-f901-47a9-a2b1-ef0456505f86
    type: Unschedulable
  phase: Pending
➜  volcano git:(add_uninqueue_state) ✗ kubectl get events |grep minavailable-job-4d40d46d-8bab-4914-9ddc-7a8e2aeda95a
18s         Normal    Unschedulable             podgroup/minavailable-job-4d40d46d-8bab-4914-9ddc-7a8e2aeda95a   queue resource quota insufficient
19s         Warning   Unschedulable             podgroup/minavailable-job-4d40d46d-8bab-4914-9ddc-7a8e2aeda95a   0/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailable

with this change

➜  volcano git:(add_uninqueue_state) ✗ # image with add_uninqueue_state-67bf4ad9a
➜  volcano git:(add_uninqueue_state) ✗ kubectl get pod -n volcano-system -l app=volcano-scheduler
NAME                                 READY   STATUS    RESTARTS   AGE
volcano-scheduler-5bc9875dbb-sjvvn   1/1     Running   0          3m35s
➜  volcano git:(add_uninqueue_state) ✗ kubectl get deployments.apps -n volcano-system volcano-scheduler -o yaml |grep "image:"
        image: volcanosh/vc-scheduler:add_uninqueue_state-67bf4ad9a
➜  volcano git:(add_uninqueue_state) ✗ kubectl get podgroups.scheduling.volcano.
NAME                                                    STATUS    MINMEMBER   RUNNINGS   AGE
minavailable-job-9374cf52-55a0-4fc5-bb2c-effacd5703d8   Pending   3                      69s
➜  volcano git:(add_uninqueue_state) ✗ kubectl get events |grep minavailable-job-9374cf52-55a0-4fc5-bb2c-effacd5703d8
73s         Normal    Uninqueueable     podgroup/minavailable-job-9374cf52-55a0-4fc5-bb2c-effacd5703d8   queue resource quota insufficient
74s         Warning   Unschedulable     podgroup/minavailable-job-9374cf52-55a0-4fc5-bb2c-effacd5703d8   0/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailable


# yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  creationTimestamp: "2023-08-12T07:29:26Z"
  generation: 3
  name: minavailable-job-9374cf52-55a0-4fc5-bb2c-effacd5703d8
  namespace: default
  ownerReferences:
  - apiVersion: batch.volcano.sh/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: minavailable-job
    uid: 9374cf52-55a0-4fc5-bb2c-effacd5703d8
  resourceVersion: "263313"
  uid: 750510bb-48a0-4cab-8d78-4e466c47b928
spec:
  minMember: 3
  minResources:
    count/pods: "3"
    cpu: 1100m
    memory: 200Mi
    pods: "3"
    requests.cpu: 1100m
    requests.memory: 200Mi
  minTaskMember:
    master: 2
    work: 1
  queue: test
status:
  conditions:
  - lastTransitionTime: "2023-08-12T07:30:32Z"
    message: queue resource quota insufficient
    reason: NotInqueueable
    status: "True"
    transitionID: 80bfabf1-df94-46cf-8d62-546080a83ae3
    type: Unschedulable
  phase: Pending

lowang-bh avatar Aug 12 '23 07:08 lowang-bh

test result with job can not enqueue image

lowang-bh avatar Aug 12 '23 07:08 lowang-bh

/assign @wangyang0616 @hwdef @william-wang @Thor-wl

lowang-bh avatar Aug 12 '23 07:08 lowang-bh

And another change is to append the origin error to msg so that both gang-unschedule info and origin reason displayed.

image image

lowang-bh avatar Aug 13 '23 02:08 lowang-bh

I think the pr is well intended, but I have two suggestions:

  1. this pr needs documentation
  2. as far as the current code is concerned, the hints are still too simple, we need hints similar to, there are xx nodes with insufficient cpu, xx nodes with insufficient gpu, xx nodes with unsatisfied memory, xx nodes with unsatisfied affinity

hwdef avatar Aug 13 '23 04:08 hwdef

there are xx nodes with insufficient cpu, xx nodes with insufficient gpu, xx nodes with unsatisfied memory, xx nodes with unsatisfied affinity

I remember those infor existed in the past version, but some subsequent prs covered those code, and now those info missed.

the hints are still too simple

This pr just add un-enqueueable reson which does not include all cases. I know what you want, eg: issue https://github.com/volcano-sh/volcano/issues/2993. It is better to do that in another PR.

lowang-bh avatar Aug 13 '23 09:08 lowang-bh

This pr just add un-enqueueable reson which does not include all cases. I know what you want, eg: issue https://github.com/volcano-sh/volcano/issues/2993. It is better to do that in another PR.

ok, I know, but I still want the docs. Because I do not know why we need this status.

hwdef avatar Aug 13 '23 13:08 hwdef

ok, I know, but I still want the docs. Because I do not know why we need this status.

Yes, I will add it later.

lowang-bh avatar Aug 14 '23 00:08 lowang-bh

Hi, @william-wang , can we push this improvement ahead by merge https://github.com/volcano-sh/apis/pull/113?

lowang-bh avatar Nov 17 '23 03:11 lowang-bh

We need to refine following message to let user know the enqueue phase and the detail reason. @Monokaix "message: '3/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailabl, origin reason: queue resource quota insufficient' reason: NotInqueueable"

william-wang avatar Dec 14 '23 12:12 william-wang

/assign @Monokaix

lowang-bh avatar Jan 25 '24 07:01 lowang-bh

@Monokaix Could we release this in v1.9.0?

lowang-bh avatar Mar 08 '24 02:03 lowang-bh

I think this is important

hwdef avatar Mar 27 '24 02:03 hwdef

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign william-wang You can assign the PR to them by writing /assign @william-wang in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot avatar Jul 26 '24 13:07 volcano-sh-bot

We need to refine following message to let user know the enqueue phase and the detail reason. @Monokaix "message: '3/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailabl, origin reason: queue resource quota insufficient' reason: NotInqueueable"

@lowang-bh Has this been resolved?

hwdef avatar Aug 13 '24 03:08 hwdef

We need to refine following message to let user know the enqueue phase and the detail reason. @Monokaix "message: '3/0 tasks in gang unschedulable: pod group is not ready, 3 minAvailabl, origin reason: queue resource quota insufficient' reason: NotInqueueable"

@lowang-bh Has this been resolved?

It is already at this level. https://github.com/volcano-sh/volcano/pull/3045#issuecomment-1676195270

lowang-bh avatar Aug 13 '24 08:08 lowang-bh

/priority important-longterm

lowang-bh avatar Aug 13 '24 08:08 lowang-bh

@lowang-bh: The label(s) priority/ cannot be applied. These labels are supported: ``

In response to this:

/priority important-longterm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot avatar Aug 13 '24 08:08 volcano-sh-bot