enhancements icon indicating copy to clipboard operation
enhancements copied to clipboard

Pass down resources to CRI

Open marquiz opened this issue 2 years ago • 23 comments

  • One-line PR description: KEP for extending the CRI API to pass down unmodified resource information from the kubelet to the CRI runtime.
  • Issue link: #4112
  • Other comments:

marquiz avatar Jun 28 '23 14:06 marquiz

/cc @haircommander @mikebrow @zvonkok @fidencio @kad

marquiz avatar Jun 28 '23 14:06 marquiz

@marquiz: GitHub didn't allow me to request PR reviews from the following users: fidencio.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @haircommander @mikebrow @zvonkok @fidencio @kad

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 28 '23 14:06 k8s-ci-robot

/retitle Pass down resources to CRI

marquiz avatar Jul 18 '23 07:07 marquiz

@marquiz We need to check how this will work with DRA and CDI devices. If we have enough information to know which devices need to be added to the sandbox just by the resource claim name.

zvonkok avatar Aug 02 '23 12:08 zvonkok

@marquiz There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: https://github.com/kubernetes/kubernetes/pull/104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it?

zvonkok avatar Aug 02 '23 12:08 zvonkok

@bergwolf @egernst FYI

zvonkok avatar Aug 02 '23 12:08 zvonkok

Another point to consider is how we're going to integrate or not these enhancements with the new containerd Sandbox API.

zvonkok avatar Aug 03 '23 09:08 zvonkok

There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: kubernetes/kubernetes#104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it?

@zvonkok that one is just the native resources and gives the resources in the "obfuscated" form i.e. not telling the actual reqeusts/limits (plus it's for Linux resources only). I think we wouldn't, or even couldn't, touch this, i.e. keep it.

marquiz avatar Aug 03 '23 13:08 marquiz

@moshe010 @adrianchiris @shivamerla @cdesiniotis FYI

zvonkok avatar Aug 03 '23 14:08 zvonkok

Since the DevicePlugin API supports CDI devices with this KEP: https://github.com/kubernetes/enhancements/pull/4011 we should try to add more restrictions and requirements how we want to design this passthrough interface. @marquiz FYI

zvonkok avatar Sep 01 '23 07:09 zvonkok

Just for reference linking this old KEP here: https://github.com/kubernetes/enhancements/pull/3080, looks like there were some comments that CDI may not be the complete solution to some use-cases. We do not want to break anything by relying only on CDI when passing down the devices.

zvonkok avatar Sep 01 '23 07:09 zvonkok

I pushed a small-ish update, adding CDI devices and mounts to the protobuf definitions.

Just for reference linking this old KEP here: #3080, looks like there were some comments that CDI may not be the complete solution to some use-cases. We do not want to break anything by relying only on CDI when passing down the devices.

@zvonkok hmm, what exact comment are you referring to?

marquiz avatar Sep 21 '23 16:09 marquiz

@marquiz See the comment here https://github.com/kubernetes/enhancements/pull/3080#issuecomment-1586061768

zvonkok avatar Sep 21 '23 16:09 zvonkok

/cc

swatisehgal avatar Jan 23 '24 18:01 swatisehgal

KEP updated, in response to review comments from @elezar

  • upated user story 3 about customized resource management
  • updated example pod spec vs RunPodSandboxRequest proto
  • added a note about device plugin resources

marquiz avatar Jan 26 '24 10:01 marquiz

/stage alpha /milestone v1.30

SergeyKanzhelev avatar Jan 26 '24 23:01 SergeyKanzhelev

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: marquiz Once this PR has been reviewed and has the lgtm label, please assign dchen1107, jpbetz for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Feb 02 '24 10:02 k8s-ci-robot

PR updated:

  • update user story 2
  • re-order sections about CRI API changes (describe RunPodSandbox before the container specific requests)
  • update the desciption of PodSandboxConfig
  • add a note to CreateContainer request that container creation may fail if resources do not match with PodSandbox request.
  • fill in test plan
  • fill in graduation criteria
  • fill in upgrade / downgrade strategy
  • fill in version skew strategy

marquiz avatar Feb 02 '24 11:02 marquiz

Why not have the Kubelet create a pod-level aggregated view of the resources? Similar to what is already done with the sandbox annotations, but without translating to the platform-specific types?

+1 to this. Getting an aggregated view at the pod level is really what this KEP is for, and it's easier to discern how the fields are meant to be used.

yujuhong avatar Feb 05 '24 22:02 yujuhong

Thanks @tallclair for the review

I'm concerned that this KEP as is forces the container runtime to reimplement too much of the container lifecycle, which in the best case puts a burden on CRI implementation maintainers, and in the worst case could slow down future Kubernetes feature development.

The goal is to not require any changes for existing CRI implementations. The CRI runtime can omit the data if it doesn't need/want to pre-allocate/pre-optimize resources for the pod. The idea is enable a kinda "forward lookup" into the future for those who need it. This probably needs to be better communicated in the proposal (and the API, with comments and naming). Thoughts?

For example, the proposed API separates out init containers & regular containers, but sidecar containers blur those lines. Now, calculating the maximum resource requirements for the pod involves accounting for sidecar containers: https://github.com/kubernetes/kubernetes/blob/4a4f5dbc079e85e63f62178af962cb65bd60d987/pkg/api/v1/resource/helpers.go#L50. I don't think we should treat this as a 1-off change.

This is a very valid point. The proposal was now changed: instead of separate lists for init and regular containers, it now has one list that contains all containers, each element in the list including the type of container (init, sidecar or regular).

Why not have the Kubelet create a pod-level aggregated view of the resources? Similar to what is already done with the sandbox annotations, but without translating to the platform-specific types?

I believe that will cause gray hairs/problems in some scenarios, e.g. in VM sizing and CoCo. For example, how would you aggregate resource limits? Also, you could make better decisions in case that an init container requests a lot of resources wrt. the regular containers. In CoCo knowing exactly what resources each container needs helps implementing the principle of least privileges/smaller attach surface (no sharing of unnecessary mounts between containers for example).

Ref e.g.: https://github.com/kubernetes/enhancements/pull/4113#discussion_r1484275415

marquiz avatar Apr 09 '24 13:04 marquiz

I pushed an update last week but forgot to leave a comment:

  • sidecar containers: instead of separate lists for init and regular containers, have one list and include the type of container (init, sidecar, regular)
  • add notes about mounts and devices where describing changes to CreateContainer and UpdateContainerResources requests
  • update description of kubelet changes: more accurate description of what information is included in each CRI request
  • fix typos
  • kep.yaml: updated milestone to v1.31

marquiz avatar Apr 16 '24 14:04 marquiz

/assign

tallclair avatar Apr 23 '24 17:04 tallclair

For example, the proposed API separates out init containers & regular containers, but sidecar containers blur those lines. Now, calculating the maximum resource requirements for the pod involves accounting for sidecar containers: https://github.com/kubernetes/kubernetes/blob/4a4f5dbc079e85e63f62178af962cb65bd60d987/pkg/api/v1/resource/helpers.go#L50. I don't think we should treat this as a 1-off change.

This is a very valid point. The proposal was now changed: instead of separate lists for init and regular containers, it now has one list that contains all containers, each element in the list including the type of container (init, sidecar or regular).

While sidecars did need to be addressed by the original proposal, it misses the big picture I was trying to raise here: exposing this pod lifecycle information into the container runtime will create friction for future k8s changes with pod lifecycle implications. Now any pod lifecycle change is potentially a breaking change to the runtime, so we need to manage runtime version skew in a way we didn't before. This is why I prefer to keep as much of the lifecycle logic in the Kubelet as we can.

tallclair avatar May 02 '24 23:05 tallclair

For example, the proposed API separates out init containers & regular containers, but sidecar containers blur those lines. Now, calculating the maximum resource requirements for the pod involves accounting for sidecar containers: https://github.com/kubernetes/kubernetes/blob/4a4f5dbc079e85e63f62178af962cb65bd60d987/pkg/api/v1/resource/helpers.go#L50. I don't think we should treat this as a 1-off change.

This is a very valid point. The proposal was now changed: instead of separate lists for init and regular containers, it now has one list that contains all containers, each element in the list including the type of container (init, sidecar or regular).

While sidecars did need to be addressed by the original proposal, it misses the big picture I was trying to raise here: exposing this pod lifecycle information into the container runtime will create friction for future k8s changes with pod lifecycle implications. Now any pod lifecycle change is potentially a breaking change to the runtime, so we need to manage runtime version skew in a way we didn't before. This is why I prefer to keep as much of the lifecycle logic in the Kubelet as we can.

Interesting, from the other side of the grpc fence.. in assisting with the CRI Server integration side for the runtime and image services, esp. for physical podsandbox level lifecycle operations, it seems here, at least to me, that the lack of podsandbox lifecycle information exposure has been a cause of friction for needed collaborative k8s/container runtime changes. Agree, a more formal podsandbox definition with additional content expected, or not, to be utilized by the CRI container runtime, runtime handlers, runtime shims, and now NRI plugins, would result in version skew exposure on that typed data. Two sides here both right, let's collaborate :-)

mikebrow avatar Jun 11 '24 17:06 mikebrow

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: marquiz Once this PR has been reviewed and has the lgtm label, please assign dchen1107, wojtek-t for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Sep 03 '24 15:09 k8s-ci-robot

An overarching theme we're seeing in 1.32 is informing the CRI about changes that happen in the kubelet more (in place VPA, pod level resources, and this) I think the CRI changes for each of these should be synced as there's a decent amount of overlap @tallclair @ndixita @marquiz . Maybe a collaborative branch of the cri changes can be made after enhancements freeze? What do folks think?

haircommander avatar Oct 01 '24 19:10 haircommander

Updated:

  • set johnbelamaric as approver in the prod-readiness yaml
  • update kep.yaml:
    • set mikebrow as reviewer, haircommander as approver
    • set feature gate name: KubeletContainerResourcesInPodSandbox
  • add an outline of NRI-based e2e tests for beta (thanks: haircommander)
  • updated GA graduation criteria (dropped vague "N installs" etc)
  • updated "How can this feature be enabled/disabled" section

Left to-do:

  • sync with pod level limits KEP
  • sync with in-place-updates KEP (beta)
  • write a summary of previous comments (as a comment in this PR)

marquiz avatar Oct 03 '24 21:10 marquiz