kubernetes icon indicating copy to clipboard operation
kubernetes copied to clipboard

WIP: DRA for 1.31

Open pohly opened this issue 1 year ago • 5 comments

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is an implementation of the revised API and features from https://github.com/kubernetes/enhancements/pull/4709.

Which issue(s) this PR fixes:

Related-to:

  • https://github.com/kubernetes/enhancements/issues/4381
  • https://github.com/kubernetes/enhancements/issues/3063

Fixes: https://github.com/kubernetes/kubernetes/issues/125665, https://github.com/kubernetes/kubernetes/issues/124041

Special notes for your reviewer:

Several of these changes where already proposed earlier in separate PRs and/or might get split out. Here's a list:

  • [x] https://github.com/kubernetes/kubernetes/pull/124931
  • [x] https://github.com/kubernetes/kubernetes/pull/124595
  • [x] https://github.com/kubernetes/kubernetes/pull/124548
  • [ ] https://github.com/kubernetes/kubernetes/pull/125163
  • [x] https://github.com/kubernetes/kubernetes/pull/125116
  • [x] https://github.com/kubernetes/kubernetes/pull/125495
  • [x] https://github.com/kubernetes/kubernetes/pull/125698
  • [x] https://github.com/kubernetes/kubernetes/pull/125699

In this PR, please only review commits starting with "DRA: remove immediate allocation".

Does this PR introduce a user-facing change?

DRA: new API and several new features

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]:  https://github.com/kubernetes/enhancements/issues/4381

pohly avatar Jun 13 '24 13:06 pohly

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

k8s-ci-robot avatar Jun 13 '24 13:06 k8s-ci-robot

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 13 '24 13:06 k8s-ci-robot

Before you push a new change, It would be great if you could either rebut or ACK+resolve comments that are addressed.

I'll definitely ACK changes that I have made or will make (I'm currently waiting for further comments before pushing the next update).

I was less sure about resolving myself. So you don't want to double-check how I resolved the comment?

pohly avatar Jun 30 '24 09:06 pohly

@klueska: GitHub didn't allow me to request PR reviews from the following users: for, more.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

My understanding is -- once this is done -- an external client would read the list of CDIDevices from the PodResources API and do something with it. More specifically, a CNI plugin would do this.

/cc @moshe010 for more info

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 04 '24 11:07 k8s-ci-robot

/test pull-kubernetes-integration pull-kubernetes-verify pull-kubernetes-verify-lint pull-kubernetes-unit pull-kubernetes-node-e2e-containerd-1-7-dra pull-kubernetes-kind-dra

I am feeling lucky... :grin:

However, I already some tests locally. There is one known, odd (random?) failure:

    $ go test ./pkg/api/testing
    --- FAIL: TestDefaulting (1.76s)
        --- FAIL: TestDefaulting/resource.k8s.io/v1alpha3,_Kind=ResourceClaimList (0.01s)
            defaulting_test.go:238: expected resource.k8s.io/v1alpha3, Kind=ResourceClaimList to trigger defaulting due to fuzzing
    FAIL
    FAIL        k8s.io/kubernetes/pkg/api/testing       17.294s
    FAIL
    $ go test -run=TestDefaulting/resource.k8s.io/v1alpha3,_Kind=ResourceClaimList ./pkg/api/testing
    ok          k8s.io/kubernetes/pkg/api/testing       0.062s

pohly avatar Jul 11 '24 11:07 pohly

/test pull-kubernetes-integration pull-kubernetes-verify pull-kubernetes-verify-lint pull-kubernetes-unit pull-kubernetes-kind-dra

The E2E node jobs (pull-kubernetes-node-e2e-containerd-1-7-dra) are hard-coded to enable v1alpha2, so they are not usable here. We either need two versions of them or update them after merging.

pohly avatar Jul 11 '24 14:07 pohly

/test pull-kubernetes-integration

pohly avatar Jul 11 '24 15:07 pohly

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

k8s-triage-robot avatar Jul 11 '24 16:07 k8s-triage-robot

I addressed review feedback and linter hints. I gave up my stance that wrapping of errors (%v vs %w) should be considered on a case-by-cases basis and instead now always wrap.

Here's the diff before I rebase because of a conflict: https://github.com/kubernetes/kubernetes/compare/9a7aad84208bb330935ce376b37f2e47cb8e2565..18417425dc31639b92c9d35343bb18b29c6a7f53

pohly avatar Jul 12 '24 09:07 pohly

/test pull-kubernetes-node-e2e-crio-cgrpv1-dra

Job updated, might work again.

pohly avatar Jul 12 '24 13:07 pohly

Do we think we want per-slice attribs in 1.31?

I would say "yes", because it will usefully compact things, and can be ignored by the driver author if they don't need it. Should we do Capacity too?

This will cause one problem though. Since Name got bumped up out of BasicDevice, all there is in BasicDevice are attributes and capacities. If a device doesn't have any per-device attribute or capacity, except for the name, then BasicDevice will be empty. Maybe if we just don't do omitempty, and we are ok? Or I guess we would just force them to put some attribute in BasicDevice, even if it is redundant?

johnbelamaric avatar Jul 13 '24 00:07 johnbelamaric

Empty structs are okay. We have a few places where those are possible and as far as I know all of the encodings and all of the clients that we have seen are okay with it

thockin avatar Jul 13 '24 03:07 thockin

/test pull-kubernetes-node-e2e-crio-cgrpv1-dra

pohly avatar Jul 13 '24 07:07 pohly

PR updates:

  • review feedback, mostly around validation, and some test cleanups: https://github.com/kubernetes/kubernetes/compare/d7d817951bcca3ee8ae534a18ab9a578cd05e93d..ade5cd5e0b61e363656d462d092dd6e21d48d5ec (if the link becomes stale, this is the diff between dra-1.31-2024-07-15-II and dra-1.31-2024-07-16-I in my repo)
  • rebased because of a conflict
  • added DRAControlPlaneController as another commit (touches several different areas, could get squashed if that's preferred)

pohly avatar Jul 16 '24 12:07 pohly

added DRAControlPlaneController

We also now have "pull-kubernetes-kind-dra" for testing with only structured parameters (= what might go beta in 1.32) and "pull-kubernetes-kind-classic-dra" (adds that new DRAControlPlaneController feature gate).

I'm aware that those E2E tests have issues in the CI that I cannot reproduce locally. Might be timing issues. I keep looking...

pohly avatar Jul 16 '24 13:07 pohly

Another force-push of squashed changes: https://github.com/kubernetes/kubernetes/compare/83c8688f9da030edf5cc10135def0ec70c2056f2..f18fc36783a298d4ef6c07a00f2c93bb5297b577 (dra-1.31-2024-07-16-III -> dra-1.31-2024-07-17-I).

Contains https://github.com/kubernetes/kubernetes/pull/126156 as base and other updates in response to review feedback.

I have to rebase before tests can be run again.

pohly avatar Jul 17 '24 14:07 pohly

As far as I am concerned API is LGTM.

thockin avatar Jul 17 '24 17:07 thockin

Rebased after https://github.com/kubernetes/kubernetes/pull/125163 and https://github.com/kubernetes/kubernetes/pull/126156 were merged. Only 14 commits left :smile:

pohly avatar Jul 19 '24 08:07 pohly

Changing timeouts in https://github.com/kubernetes/kubernetes/pull/125488/commits/13c74c9db277359fc346a4bc2773addded17d59d may have helped stabilize the E2E node tests. Let's try again...

/test pull-kubernetes-node-e2e-containerd-1-7-dra pull-kubernetes-node-e2e-crio-cgrpv1-dra pull-kubernetes-node-e2e-crio-cgrpv2-dra

pohly avatar Jul 19 '24 18:07 pohly

Hi all, I will be AFK the last day before code freeze, I am approving and holding now, so I don't forget. Unless something has changed API wise, I am good with it.

/approve /hold

thockin avatar Jul 22 '24 04:07 thockin

Left some minor comments on the kubelet side of things. Besides those LGTM from the kubelet side.

klueska avatar Jul 22 '24 11:07 klueska

/test pull-kubernetes-node-e2e-crio-cgrpv2-dra

Setup failed with a flake (? "Error while dialing: dial unix /var/run/crio/crio.sock: connect: no such file or directory").

https://github.com/kubernetes/kubernetes/pull/125488#pullrequestreview-2192093009:

However, scheduler_perf is not included in a PR's presubmit-CI

It is enabled, and there is one test scenario ("SchedulingWithResourceClaimTemplateStructured") for the new API. More can and will be added once we start investigating performance a bit more. Right now, it's more about functionality.

@klueska: would you do us the honor and LGTM together with lifting the hold?

We have positive reviews and approval from (not collecting links, but they are there...):

  • @thockin (API)
  • @johnbelamaric (scheduling logic, diverse aspects)
  • @Huang-Wei, @kerthcet (SIG Scheduling)
  • @liggitt (SIG Auth, api-machinery)
  • @klueska (SIG Node)
  • and several others

Thanks everyone!

pohly avatar Jul 22 '24 17:07 pohly

Gladly. Thanks everyone! One step closer...

/approve /lgtm /unhold

klueska avatar Jul 22 '24 17:07 klueska

LGTM label has been added.

Git tree hash: ce1d7771db93f922ecda8bf61041a6be1c84de6d

k8s-ci-robot avatar Jul 22 '24 17:07 k8s-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei, klueska, pohly, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 22 '24 17:07 k8s-ci-robot