volcano icon indicating copy to clipboard operation
volcano copied to clipboard

put back the queue to priority queue after job's resource allocating …

Open panoswoo opened this issue 1 year ago • 28 comments

Resolves #3407

Put back the queue to priority queue after job's resource allocating finished to ensure that the priority of the queue is calculated based on the latest resource allocation situation.

panoswoo avatar Apr 15 '24 03:04 panoswoo

/retest

panoswoo avatar Apr 15 '24 03:04 panoswoo

@panoswoo: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot avatar Apr 15 '24 03:04 volcano-sh-bot

Hi, please add some ut to cover it. Thanks

lowang-bh avatar Apr 15 '24 03:04 lowang-bh

Hi, please add some ut to cover it. Thanks

Done, but i haven't found a simple way to check the allocate order of tasks, so I made some extensions to TestCommonStruct to enable it to record the actual allocation order. This change is compatible with previous cases. Looking forward to your suggestions

panoswoo avatar Apr 15 '24 09:04 panoswoo

Hi, please add some ut to cover it. Thanks

Done, but i haven't found a simple way to check the allocate order of tasks, so I made some extensions to TestCommonStruct to enable it to record the actual allocation order. This change is compatible with previous cases. Looking forward to your suggestions

Don't need to do that. Just let the whole resource only enough for allocating first few jobs and check the left jobs are pending.

lowang-bh avatar Apr 15 '24 09:04 lowang-bh

Hi, please add some ut to cover it. Thanks

Done, but i haven't found a simple way to check the allocate order of tasks, so I made some extensions to TestCommonStruct to enable it to record the actual allocation order. This change is compatible with previous cases. Looking forward to your suggestions

Don't need to do that. Just let the whole resource only enough for allocating first few jobs and check the left jobs are pending.

I have considered doing this before, but due to the same weight for both queues, when I try to fill up all resources with tasks, they will be rejected by the proportion plugin during Overused and Allocable detection. (because we only have two queues, if we want to fill up all resources, must have one queue's resource allocated is more than half). But I just realized that maybe I can disable the Overused and Allocatable checks directly in the plugin's config.

panoswoo avatar Apr 15 '24 11:04 panoswoo

Don't need to do that. Just let the whole resource only enough for allocating first few jobs and check the left jobs are pending.

Add two cases:

  • total 5cpus, pod-small-1 use 1cpu, then pod-large-2 use 3cpu, and then pod-large-1 pending, indicate that q-2 is allocated first
  • total 5 cpus, pod-small-1 use 1cpu, then pod-large-2 use 2cpu, and then pod-large-1 use 2 cpus, pod-small-2 is pending, indicate that q-2 is first allocated and then q-1 is allocated

lowang-bh avatar Apr 16 '24 02:04 lowang-bh

Wait #3408 fix the failed UT.

lowang-bh avatar Apr 16 '24 02:04 lowang-bh

May be we need to adjust the e2e test case.

lowang-bh avatar Apr 16 '24 05:04 lowang-bh

May be we need to adjust the e2e test case.

let me take a look

panoswoo avatar Apr 16 '24 06:04 panoswoo

@lowang-bh Hi, I seem to have found some issues while searching for the volcano component logs generated by e2e testing.

  1. when e2e test failed, we will backup volcano component logs by https://github.com/volcano-sh/volcano/blob/c414e56be9d238efd4665ba81d06a535f55dc6b9/hack/run-e2e-kind.sh#L69-L74 we are trying to get log from namespace kube-system but those component are installed in volcano-system actually https://github.com/volcano-sh/volcano/blob/c414e56be9d238efd4665ba81d06a535f55dc6b9/hack/run-e2e-kind.sh#L57
  2. it seems that we didn't upload the log file at the end of the workflow https://github.com/volcano-sh/volcano/blob/c414e56be9d238efd4665ba81d06a535f55dc6b9/.github/workflows/e2e_scheduling_basic.yaml#L39-L41

panoswoo avatar Apr 16 '24 07:04 panoswoo

  1. we are trying to get log from namespace kube-system but those component are installed in volcano-system actually

That is a issue, now we use volcano-system. You can file another PR an merge it first. @Monokaix

lowang-bh avatar Apr 16 '24 10:04 lowang-bh

/retest

lowang-bh avatar Apr 19 '24 13:04 lowang-bh

May be we need to adjust the e2e test case.

I triggered the test again and it passed. I haven't made any modifications. Have we recently merged any fixes?

panoswoo avatar Apr 20 '24 13:04 panoswoo

May be we need to adjust the e2e test case.

I triggered the test again and it passed. I haven't made any modifications. Have we recently merged any fixes?

No update. Maybe it is randomly failed. We'd better check it.

lowang-bh avatar Apr 20 '24 13:04 lowang-bh

May be we need to adjust the e2e test case.

I triggered the test again and it passed. I haven't made any modifications. Have we recently merged any fixes?

No update. Maybe it is randomly failed. We'd better check it.

I am unable to reproduce the problem now. I don't have detailed logs of the previous exception cases, and this error is a large process (involving multiple components) that has timed out. I can't determine which part of the problem went wrong without detailed logs :(

panoswoo avatar Apr 20 '24 14:04 panoswoo

@panoswoo please make the ci succeed and so we can make this pr merged.

lowang-bh avatar Apr 27 '24 09:04 lowang-bh

/retest

panoswoo avatar Apr 27 '24 09:04 panoswoo

@panoswoo: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot avatar Apr 27 '24 09:04 volcano-sh-bot

@panoswoo please make the ci succeed and so we can make this pr merged.

I will fix it ASAP

panoswoo avatar Apr 27 '24 09:04 panoswoo

/ok-to-test

lowang-bh avatar Apr 27 '24 11:04 lowang-bh

/retest

lowang-bh avatar Apr 27 '24 11:04 lowang-bh

/assign @Monokaix @hwdef Please also help to review it. Thanks.

lowang-bh avatar Apr 27 '24 11:04 lowang-bh

/retest

panoswoo avatar Apr 27 '24 11:04 panoswoo

HI, @panoswoo , You can close and reopen it to trigger the CI.

lowang-bh avatar May 02 '24 13:05 lowang-bh

/lgtm

lowang-bh avatar May 14 '24 14:05 lowang-bh

Why the former e2e are failed?

Monokaix avatar May 15 '24 01:05 Monokaix

New changes are detected. LGTM label has been removed.

volcano-sh-bot avatar May 21 '24 02:05 volcano-sh-bot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign lowang-bh You can assign the PR to them by writing /assign @lowang-bh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot avatar May 21 '24 02:05 volcano-sh-bot

Please squash to one commit: )

Monokaix avatar May 21 '24 02:05 Monokaix

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot avatar Jun 14 '24 02:06 volcano-sh-bot

/hold

has case not covered.

lowang-bh avatar Jun 14 '24 02:06 lowang-bh