feat: add hierarchical queues for capacity plugin
ref: #3590
/assign @Monokaix @lowang-bh @shinytang6 Please take a look 👀
Is somewhere validate that job can only be submitted to leaf node?
Please add some E2E test for this function. Because this function is very important.
Please add some E2E test for this function. Because this function is very important.
I will add it later
Is somewhere validate that job can only be submitted to leaf node?
I have already added this logic in the validate webhook of the job to ensure that jobs can only be submitted to leaf queues.
/cc
Please consider this scenario:
- Use default config create a cluster (./hack/local-up-volcano.sh)
- This configuration uses the proportion plugin by default and automatically creates a default queue
- Switch to the capacity plugin and enable hierarchical queues
I believe this is how most existing users use hierarchical queues.
result:
- root queue will not be created.
- scheduler panic:
2024/10/03 14:13:01 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
I1003 14:13:01.520160 1 flags.go:57] FLAG: --add-dir-header="false"
I1003 14:13:01.520174 1 flags.go:57] FLAG: --alsologtostderr="false"
I1003 14:13:01.520176 1 flags.go:57] FLAG: --ca-cert-file=""
I1003 14:13:01.520179 1 flags.go:57] FLAG: --cache-dump-dir="/tmp"
I1003 14:13:01.520181 1 flags.go:57] FLAG: --cache-dumper="true"
I1003 14:13:01.520183 1 flags.go:57] FLAG: --csi-storage="false"
I1003 14:13:01.520185 1 flags.go:57] FLAG: --default-queue="default"
I1003 14:13:01.520187 1 flags.go:57] FLAG: --enable-healthz="true"
I1003 14:13:01.520189 1 flags.go:57] FLAG: --enable-metrics="true"
I1003 14:13:01.520192 1 flags.go:57] FLAG: --feature-gates=""
I1003 14:13:01.520196 1 flags.go:57] FLAG: --healthz-address=":11251"
I1003 14:13:01.520199 1 flags.go:57] FLAG: --ignored-provisioners="[]"
I1003 14:13:01.520210 1 flags.go:57] FLAG: --kube-api-burst="2000"
I1003 14:13:01.520214 1 flags.go:57] FLAG: --kube-api-qps="2000"
I1003 14:13:01.520218 1 flags.go:57] FLAG: --kubeconfig=""
I1003 14:13:01.520221 1 flags.go:57] FLAG: --leader-elect="false"
I1003 14:13:01.520223 1 flags.go:57] FLAG: --leader-elect-lease-duration="15s"
I1003 14:13:01.520229 1 flags.go:57] FLAG: --leader-elect-renew-deadline="10s"
I1003 14:13:01.520232 1 flags.go:57] FLAG: --leader-elect-resource-lock="leases"
I1003 14:13:01.520234 1 flags.go:57] FLAG: --leader-elect-resource-name="volcano"
I1003 14:13:01.520237 1 flags.go:57] FLAG: --leader-elect-resource-namespace="volcano-system"
I1003 14:13:01.520239 1 flags.go:57] FLAG: --leader-elect-retry-period="2s"
I1003 14:13:01.520242 1 flags.go:57] FLAG: --listen-address=":8080"
I1003 14:13:01.520244 1 flags.go:57] FLAG: --lock-object-namespace="volcano-system"
I1003 14:13:01.520247 1 flags.go:57] FLAG: --log-backtrace-at=":0"
I1003 14:13:01.520251 1 flags.go:57] FLAG: --log-dir=""
I1003 14:13:01.520255 1 flags.go:57] FLAG: --log-file=""
I1003 14:13:01.520259 1 flags.go:57] FLAG: --log-file-max-size="1800"
I1003 14:13:01.520263 1 flags.go:57] FLAG: --log-flush-frequency="5s"
I1003 14:13:01.520267 1 flags.go:57] FLAG: --logtostderr="true"
I1003 14:13:01.520270 1 flags.go:57] FLAG: --master=""
I1003 14:13:01.520274 1 flags.go:57] FLAG: --minimum-feasible-nodes="100"
I1003 14:13:01.520284 1 flags.go:57] FLAG: --minimum-percentage-nodes-to-find="5"
I1003 14:13:01.520287 1 flags.go:57] FLAG: --node-selector="[]"
I1003 14:13:01.520295 1 flags.go:57] FLAG: --node-worker-threads="20"
I1003 14:13:01.520303 1 flags.go:57] FLAG: --one-output="false"
I1003 14:13:01.520306 1 flags.go:57] FLAG: --percentage-nodes-to-find="0"
I1003 14:13:01.520309 1 flags.go:57] FLAG: --plugins-dir=""
I1003 14:13:01.520312 1 flags.go:57] FLAG: --priority-class="true"
I1003 14:13:01.520315 1 flags.go:57] FLAG: --schedule-period="1s"
I1003 14:13:01.520319 1 flags.go:57] FLAG: --scheduler-conf="/volcano.scheduler/volcano-scheduler.conf"
I1003 14:13:01.520322 1 flags.go:57] FLAG: --scheduler-name="[volcano]"
I1003 14:13:01.520328 1 flags.go:57] FLAG: --skip-headers="false"
I1003 14:13:01.520331 1 flags.go:57] FLAG: --skip-log-headers="false"
I1003 14:13:01.520333 1 flags.go:57] FLAG: --stderrthreshold="2"
I1003 14:13:01.520338 1 flags.go:57] FLAG: --tls-cert-file=""
I1003 14:13:01.520340 1 flags.go:57] FLAG: --tls-private-key-file=""
I1003 14:13:01.520343 1 flags.go:57] FLAG: --v="3"
I1003 14:13:01.520348 1 flags.go:57] FLAG: --version="false"
I1003 14:13:01.520350 1 flags.go:57] FLAG: --vmodule=""
W1003 14:13:01.520367 1 client_config.go:659] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
panic: failed init default queue, with err: admission webhook "mutatequeue.volcano.sh" denied the request: failed to get parent queue of open queue default: queues.scheduling.volcano.sh "root" not found
goroutine 1 [running]:
volcano.sh/volcano/pkg/scheduler/cache.newDefaultQueue({0x25de198, 0xc0000e3620}, {0x22a7c80, 0x7})
/go/src/volcano.sh/volcano/pkg/scheduler/cache/cache.go:513 +0x1d3
volcano.sh/volcano/pkg/scheduler/cache.newSchedulerCache(0xc00017d688, {0xc00005aad0, 0x1, 0x1}, {0x22a7c80, 0x7}, {0x0, 0x0, 0x0}, 0x14, ...)
/go/src/volcano.sh/volcano/pkg/scheduler/cache/cache.go:532 +0xfd
volcano.sh/volcano/pkg/scheduler/cache.New(...)
/go/src/volcano.sh/volcano/pkg/scheduler/cache/cache.go:92
volcano.sh/volcano/pkg/scheduler.NewScheduler(0xc00017d688?, 0xc0000ec000)
/go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:70 +0xeb
volcano.sh/volcano/cmd/scheduler/app.Run(0xc0000ec000)
/go/src/volcano.sh/volcano/cmd/scheduler/app/server.go:71 +0x1a5
main.main()
/go/src/volcano.sh/volcano/cmd/scheduler/main.go:86 +0x325
I got some errors when create a vcjob
Failed to create root queue, error: queues.scheduling.volcano.sh "root" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
/ok-to-test
In the scenario of scheduling directly with PodGroup (without Volcano Job), the logic for checking whether a Parent Queue is unable to submit is missing.
Maybe we need a webhooks.admission.podgroups.validate to check this?
In the scenario of scheduling directly with
PodGroup(withoutVolcano Job), the logic for checking whether a Parent Queue is unable to submit is missing.Maybe we need a
webhooks.admission.podgroups.validateto check this?
Thank you for your suggestion. I will add the missing logic.
Look good to me. @Monokaix @googs1025 @TaiPark @lowang-bh Please review this again. Can we merge this PR?
@Rui-Gan please squash your commit to one. and rebase master code
In the scenario of scheduling directly with
PodGroup(withoutVolcano Job), the logic for checking whether a Parent Queue is unable to submit is missing. Maybe we need awebhooks.admission.podgroups.validateto check this?Thank you for your suggestion. I will add the missing logic.
Is this proposal still under consideration? I don't seem to find related logics.
In the scenario of scheduling directly with
PodGroup(withoutVolcano Job), the logic for checking whether a Parent Queue is unable to submit is missing. Maybe we need awebhooks.admission.podgroups.validateto check this?Thank you for your suggestion. I will add the missing logic.
Is this proposal still under consideration? I don't seem to find related logics.
Regarding this issue, I think it would be better to submit another PR to handle this after the current PR is merged. There are a couple of reasons for this:
-
The use case of directly using podgroup for scheduling is relatively rare, which is why there was no related validate webhook before. For example, when creating or updating the queue field of a podgroup, there is no validation to check if the queue exists. The hdrf plugin also does not perform any validation related to hierarchical queue enablement.
-
Currently, adding this validation logic involves a significant amount of code changes, and the e2e tests cannot pass successfully. This is because the deployment yaml files need to be modified, and when running e2e tests on GitHub, the old yaml files are used. Therefore, if we directly include this in the current PR, the e2e tests will not pass.
Based on the above considerations, I believe this logic can be implemented in a separate PR after the current PR is merged. What do you think? Is this a particularly urgent matter?
In the scenario of scheduling directly with
PodGroup(withoutVolcano Job), the logic for checking whether a Parent Queue is unable to submit is missing. Maybe we need awebhooks.admission.podgroups.validateto check this?Thank you for your suggestion. I will add the missing logic.
Is this proposal still under consideration? I don't seem to find related logics.
Regarding this issue, I think it would be better to submit another PR to handle this after the current PR is merged. There are a couple of reasons for this:
- The use case of directly using podgroup for scheduling is relatively rare, which is why there was no related validate webhook before. For example, when creating or updating the queue field of a podgroup, there is no validation to check if the queue exists. The hdrf plugin also does not perform any validation related to hierarchical queue enablement.
- Currently, adding this validation logic involves a significant amount of code changes, and the e2e tests cannot pass successfully. This is because the deployment yaml files need to be modified, and when running e2e tests on GitHub, the old yaml files are used. Therefore, if we directly include this in the current PR, the e2e tests will not pass.
Based on the above considerations, I believe this logic can be implemented in a separate PR after the current PR is merged. What do you think? Is this a particularly urgent matter?
If the implementation is quite complex, it would be acceptable to commit a separate PR for it.
/lgtm
I don't have any more questions, looks good to me
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: Monokaix
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [Monokaix]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment