WIP: DRA for 1.31
What type of PR is this?
/kind feature
What this PR does / why we need it:
This is an implementation of the revised API and features from https://github.com/kubernetes/enhancements/pull/4709.
Which issue(s) this PR fixes:
Related-to:
- https://github.com/kubernetes/enhancements/issues/4381
- https://github.com/kubernetes/enhancements/issues/3063
Fixes: https://github.com/kubernetes/kubernetes/issues/125665, https://github.com/kubernetes/kubernetes/issues/124041
Special notes for your reviewer:
Several of these changes where already proposed earlier in separate PRs and/or might get split out. Here's a list:
- [x] https://github.com/kubernetes/kubernetes/pull/124931
- [x] https://github.com/kubernetes/kubernetes/pull/124595
- [x] https://github.com/kubernetes/kubernetes/pull/124548
- [ ] https://github.com/kubernetes/kubernetes/pull/125163
- [x] https://github.com/kubernetes/kubernetes/pull/125116
- [x] https://github.com/kubernetes/kubernetes/pull/125495
- [x] https://github.com/kubernetes/kubernetes/pull/125698
- [x] https://github.com/kubernetes/kubernetes/pull/125699
In this PR, please only review commits starting with "DRA: remove immediate allocation".
Does this PR introduce a user-facing change?
DRA: new API and several new features
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
- [KEP]: https://github.com/kubernetes/enhancements/issues/4381
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Before you push a new change, It would be great if you could either rebut or ACK+resolve comments that are addressed.
I'll definitely ACK changes that I have made or will make (I'm currently waiting for further comments before pushing the next update).
I was less sure about resolving myself. So you don't want to double-check how I resolved the comment?
@klueska: GitHub didn't allow me to request PR reviews from the following users: for, more.
Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.
In response to this:
My understanding is -- once this is done -- an external client would read the list of CDIDevices from the PodResources API and do something with it. More specifically, a CNI plugin would do this.
/cc @moshe010 for more info
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/test pull-kubernetes-integration pull-kubernetes-verify pull-kubernetes-verify-lint pull-kubernetes-unit pull-kubernetes-node-e2e-containerd-1-7-dra pull-kubernetes-kind-dra
I am feeling lucky... :grin:
However, I already some tests locally. There is one known, odd (random?) failure:
$ go test ./pkg/api/testing
--- FAIL: TestDefaulting (1.76s)
--- FAIL: TestDefaulting/resource.k8s.io/v1alpha3,_Kind=ResourceClaimList (0.01s)
defaulting_test.go:238: expected resource.k8s.io/v1alpha3, Kind=ResourceClaimList to trigger defaulting due to fuzzing
FAIL
FAIL k8s.io/kubernetes/pkg/api/testing 17.294s
FAIL
$ go test -run=TestDefaulting/resource.k8s.io/v1alpha3,_Kind=ResourceClaimList ./pkg/api/testing
ok k8s.io/kubernetes/pkg/api/testing 0.062s
/test pull-kubernetes-integration pull-kubernetes-verify pull-kubernetes-verify-lint pull-kubernetes-unit pull-kubernetes-kind-dra
The E2E node jobs (pull-kubernetes-node-e2e-containerd-1-7-dra) are hard-coded to enable v1alpha2, so they are not usable here. We either need two versions of them or update them after merging.
/test pull-kubernetes-integration
This PR may require API review.
If so, when the changes are ready, complete the pre-review checklist and request an API review.
Status of requested reviews is tracked in the API Review project.
I addressed review feedback and linter hints. I gave up my stance that wrapping of errors (%v vs %w) should be considered on a case-by-cases basis and instead now always wrap.
Here's the diff before I rebase because of a conflict: https://github.com/kubernetes/kubernetes/compare/9a7aad84208bb330935ce376b37f2e47cb8e2565..18417425dc31639b92c9d35343bb18b29c6a7f53
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra
Job updated, might work again.
Do we think we want per-slice attribs in 1.31?
I would say "yes", because it will usefully compact things, and can be ignored by the driver author if they don't need it. Should we do Capacity too?
This will cause one problem though. Since Name got bumped up out of BasicDevice, all there is in BasicDevice are attributes and capacities. If a device doesn't have any per-device attribute or capacity, except for the name, then BasicDevice will be empty. Maybe if we just don't do omitempty, and we are ok? Or I guess we would just force them to put some attribute in BasicDevice, even if it is redundant?
Empty structs are okay. We have a few places where those are possible and as far as I know all of the encodings and all of the clients that we have seen are okay with it
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra
PR updates:
- review feedback, mostly around validation, and some test cleanups: https://github.com/kubernetes/kubernetes/compare/d7d817951bcca3ee8ae534a18ab9a578cd05e93d..ade5cd5e0b61e363656d462d092dd6e21d48d5ec (if the link becomes stale, this is the diff between dra-1.31-2024-07-15-II and dra-1.31-2024-07-16-I in my repo)
- rebased because of a conflict
- added DRAControlPlaneController as another commit (touches several different areas, could get squashed if that's preferred)
added DRAControlPlaneController
We also now have "pull-kubernetes-kind-dra" for testing with only structured parameters (= what might go beta in 1.32) and "pull-kubernetes-kind-classic-dra" (adds that new DRAControlPlaneController feature gate).
I'm aware that those E2E tests have issues in the CI that I cannot reproduce locally. Might be timing issues. I keep looking...
Another force-push of squashed changes: https://github.com/kubernetes/kubernetes/compare/83c8688f9da030edf5cc10135def0ec70c2056f2..f18fc36783a298d4ef6c07a00f2c93bb5297b577 (dra-1.31-2024-07-16-III -> dra-1.31-2024-07-17-I).
Contains https://github.com/kubernetes/kubernetes/pull/126156 as base and other updates in response to review feedback.
I have to rebase before tests can be run again.
As far as I am concerned API is LGTM.
Rebased after https://github.com/kubernetes/kubernetes/pull/125163 and https://github.com/kubernetes/kubernetes/pull/126156 were merged. Only 14 commits left :smile:
Changing timeouts in https://github.com/kubernetes/kubernetes/pull/125488/commits/13c74c9db277359fc346a4bc2773addded17d59d may have helped stabilize the E2E node tests. Let's try again...
/test pull-kubernetes-node-e2e-containerd-1-7-dra pull-kubernetes-node-e2e-crio-cgrpv1-dra pull-kubernetes-node-e2e-crio-cgrpv2-dra
Hi all, I will be AFK the last day before code freeze, I am approving and holding now, so I don't forget. Unless something has changed API wise, I am good with it.
/approve /hold
Left some minor comments on the kubelet side of things. Besides those LGTM from the kubelet side.
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra
Setup failed with a flake (? "Error while dialing: dial unix /var/run/crio/crio.sock: connect: no such file or directory").
https://github.com/kubernetes/kubernetes/pull/125488#pullrequestreview-2192093009:
However, scheduler_perf is not included in a PR's presubmit-CI
It is enabled, and there is one test scenario ("SchedulingWithResourceClaimTemplateStructured") for the new API. More can and will be added once we start investigating performance a bit more. Right now, it's more about functionality.
@klueska: would you do us the honor and LGTM together with lifting the hold?
We have positive reviews and approval from (not collecting links, but they are there...):
- @thockin (API)
- @johnbelamaric (scheduling logic, diverse aspects)
- @Huang-Wei, @kerthcet (SIG Scheduling)
- @liggitt (SIG Auth, api-machinery)
- @klueska (SIG Node)
- and several others
Thanks everyone!
Gladly. Thanks everyone! One step closer...
/approve /lgtm /unhold
LGTM label has been added.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: Huang-Wei, klueska, pohly, thockin
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~api/OWNERS~~ [thockin]
- ~~cmd/kube-controller-manager/OWNERS~~ [thockin]
- ~~hack/OWNERS~~ [pohly,thockin]
- ~~pkg/api/OWNERS~~ [thockin]
- ~~pkg/apis/OWNERS~~ [thockin]
- ~~pkg/controller/resourceclaim/OWNERS~~ [klueska,pohly,thockin]
- ~~pkg/controlplane/OWNERS~~ [thockin]
- ~~pkg/features/OWNERS~~ [thockin]
- ~~pkg/generated/openapi/OWNERS~~ [thockin]
- ~~pkg/kubeapiserver/authorizer/OWNERS~~ [thockin]
- ~~pkg/kubectl/OWNERS~~ [thockin]
- ~~pkg/kubelet/OWNERS~~ [klueska,thockin]
- ~~pkg/printers/OWNERS~~ [thockin]
- ~~pkg/registry/OWNERS~~ [thockin]
- ~~pkg/scheduler/OWNERS~~ [Huang-Wei,thockin]
- ~~plugin/pkg/admission/noderestriction/OWNERS~~ [thockin]
- ~~plugin/pkg/auth/authorizer/OWNERS~~ [thockin]
- ~~staging/publishing/OWNERS~~ [thockin]
- ~~staging/src/k8s.io/api/OWNERS~~ [thockin]
- ~~staging/src/k8s.io/apiserver/pkg/cel/OWNERS~~ [thockin]
- ~~staging/src/k8s.io/cli-runtime/OWNERS~~ [thockin]
- ~~staging/src/k8s.io/client-go/OWNERS~~ [thockin]
- ~~staging/src/k8s.io/dynamic-resource-allocation/OWNERS~~ [klueska,pohly,thockin]
- ~~staging/src/k8s.io/kubelet/OWNERS~~ [klueska,thockin]
- ~~test/OWNERS~~ [pohly,thockin]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment