volcano put back the queue to priority queue after job's resource allocating …

Resolves #3407

Put back the queue to priority queue after job's resource allocating finished to ensure that the priority of the queue is calculated based on the latest resource allocation situation.

Apr 15 '24 03:04 panoswoo

/retest

Apr 15 '24 03:04 panoswoo

@panoswoo: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 15 '24 03:04 volcano-sh-bot

Hi, please add some ut to cover it. Thanks

Apr 15 '24 03:04 lowang-bh

Hi, please add some ut to cover it. Thanks

Done, but i haven't found a simple way to check the allocate order of tasks, so I made some extensions to TestCommonStruct to enable it to record the actual allocation order. This change is compatible with previous cases. Looking forward to your suggestions

Apr 15 '24 09:04 panoswoo

Hi, please add some ut to cover it. Thanks

Done, but i haven't found a simple way to check the allocate order of tasks, so I made some extensions to TestCommonStruct to enable it to record the actual allocation order. This change is compatible with previous cases. Looking forward to your suggestions

Don't need to do that. Just let the whole resource only enough for allocating first few jobs and check the left jobs are pending.

Apr 15 '24 09:04 lowang-bh

Hi, please add some ut to cover it. Thanks

Done, but i haven't found a simple way to check the allocate order of tasks, so I made some extensions to TestCommonStruct to enable it to record the actual allocation order. This change is compatible with previous cases. Looking forward to your suggestions

Don't need to do that. Just let the whole resource only enough for allocating first few jobs and check the left jobs are pending.

I have considered doing this before, but due to the same weight for both queues, when I try to fill up all resources with tasks, they will be rejected by the proportion plugin during Overused and Allocable detection. (because we only have two queues, if we want to fill up all resources, must have one queue's resource allocated is more than half). But I just realized that maybe I can disable the Overused and Allocatable checks directly in the plugin's config.

Apr 15 '24 11:04 panoswoo

Don't need to do that. Just let the whole resource only enough for allocating first few jobs and check the left jobs are pending.

Add two cases:

total 5cpus, pod-small-1 use 1cpu, then pod-large-2 use 3cpu, and then pod-large-1 pending, indicate that q-2 is allocated first
total 5 cpus, pod-small-1 use 1cpu, then pod-large-2 use 2cpu, and then pod-large-1 use 2 cpus, pod-small-2 is pending, indicate that q-2 is first allocated and then q-1 is allocated

Apr 16 '24 02:04 lowang-bh

Wait #3408 fix the failed UT.

Apr 16 '24 02:04 lowang-bh

May be we need to adjust the e2e test case.

Apr 16 '24 05:04 lowang-bh

May be we need to adjust the e2e test case.

let me take a look

Apr 16 '24 06:04 panoswoo

@lowang-bh Hi, I seem to have found some issues while searching for the volcano component logs generated by e2e testing.

when e2e test failed, we will backup volcano component logs by https://github.com/volcano-sh/volcano/blob/c414e56be9d238efd4665ba81d06a535f55dc6b9/hack/run-e2e-kind.sh#L69-L74 we are trying to get log from namespace kube-system but those component are installed in volcano-system actually https://github.com/volcano-sh/volcano/blob/c414e56be9d238efd4665ba81d06a535f55dc6b9/hack/run-e2e-kind.sh#L57
it seems that we didn't upload the log file at the end of the workflow https://github.com/volcano-sh/volcano/blob/c414e56be9d238efd4665ba81d06a535f55dc6b9/.github/workflows/e2e_scheduling_basic.yaml#L39-L41

Apr 16 '24 07:04 panoswoo

we are trying to get log from namespace kube-system but those component are installed in volcano-system actually

That is a issue, now we use volcano-system. You can file another PR an merge it first. @Monokaix

Apr 16 '24 10:04 lowang-bh

/retest

Apr 19 '24 13:04 lowang-bh

May be we need to adjust the e2e test case.

I triggered the test again and it passed. I haven't made any modifications. Have we recently merged any fixes?

Apr 20 '24 13:04 panoswoo

May be we need to adjust the e2e test case.

I triggered the test again and it passed. I haven't made any modifications. Have we recently merged any fixes?

No update. Maybe it is randomly failed. We'd better check it.

Apr 20 '24 13:04 lowang-bh

May be we need to adjust the e2e test case.

I triggered the test again and it passed. I haven't made any modifications. Have we recently merged any fixes?

No update. Maybe it is randomly failed. We'd better check it.

I am unable to reproduce the problem now. I don't have detailed logs of the previous exception cases, and this error is a large process (involving multiple components) that has timed out. I can't determine which part of the problem went wrong without detailed logs :(

Apr 20 '24 14:04 panoswoo

@panoswoo please make the ci succeed and so we can make this pr merged.

Apr 27 '24 09:04 lowang-bh

/retest

Apr 27 '24 09:04 panoswoo

@panoswoo: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 27 '24 09:04 volcano-sh-bot

@panoswoo please make the ci succeed and so we can make this pr merged.

I will fix it ASAP

Apr 27 '24 09:04 panoswoo

/ok-to-test

Apr 27 '24 11:04 lowang-bh

/retest

Apr 27 '24 11:04 lowang-bh

/assign @Monokaix @hwdef Please also help to review it. Thanks.

Apr 27 '24 11:04 lowang-bh

/retest

Apr 27 '24 11:04 panoswoo

HI, @panoswoo , You can close and reopen it to trigger the CI.

May 02 '24 13:05 lowang-bh

/lgtm

May 14 '24 14:05 lowang-bh

Why the former e2e are failed?

May 15 '24 01:05 Monokaix

New changes are detected. LGTM label has been removed.

May 21 '24 02:05 volcano-sh-bot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign lowang-bh You can assign the PR to them by writing /assign @lowang-bh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

May 21 '24 02:05 volcano-sh-bot

Please squash to one commit: )

May 21 '24 02:05 Monokaix

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [william-wang]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jun 14 '24 02:06 volcano-sh-bot

/hold

has case not covered.

Jun 14 '24 02:06 lowang-bh

volcano volcano copied to clipboard

put back the queue to priority queue after job's resource allocating …

volcano
volcano copied to clipboard