enhancements icon indicating copy to clipboard operation
enhancements copied to clipboard

QoS-class resources

Open marquiz opened this issue 3 years ago β€’ 48 comments

Enhancement Description

  • One-line enhancement description (can be used as a release note): Add QoS-class resources to Kubernetes
  • Kubernetes Enhancement Proposal: #3004
  • Discussion Link: https://groups.google.com/g/kubernetes-sig-node/c/UoxYzZ7gCbg
  • Primary contact (assignee): @marquiz
  • Responsible SIGs: SIG-Node
  • Enhancement target (which target equals to which milestone):
    • Alpha release target (x.y): 1.29
    • Beta release target (x.y):
    • Stable release target (x.y):
  • [ ] Alpha
    • [ ] KEP (k/enhancements) update PR(s): https://github.com/kubernetes/enhancements/pull/3004
    • [ ] Code (k/k) update PR(s):
    • [ ] Docs (k/website) update PR(s):

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

marquiz avatar Oct 11 '21 06:10 marquiz

Thanks for opening this @marquiz !! :smile:

To ensure that the sig is aware of and that communication has begun regarding this KEP, please add the mandatory Discussion Link to the Description above. For ref it is a "link to SIG mailing list thread, meeting, or recording where the Enhancement was discussed before KEP creation"

kikisdeliveryservice avatar Oct 11 '21 18:10 kikisdeliveryservice

@kikisdeliveryservice the topic was discussed on SIG-Node on 2021-10-19. Meeting minutes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg

kad avatar Oct 28 '21 14:10 kad

/sig node

pacoxu avatar Jan 06 '22 10:01 pacoxu

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 31 '22 08:05 k8s-triage-robot

/remove-lifecycle stale

marquiz avatar Jun 06 '22 14:06 marquiz

/milestone v1.25

Priyankasaggu11929 avatar Jun 10 '22 02:06 Priyankasaggu11929

Hello @marquiz :wave:, 1.25 Enhancements team here!

Just checking in as we approach enhancements freeze on 18:00 PST on Thursday June 16, 2022. For note, This enhancement is targeting for stage alpha for 1.25 release

Here’s where this enhancement currently stands:

  • [ ] KEP file using the latest template has been merged into the k/enhancements repo.
  • [ ] KEP status is marked as implementable
  • [ ] KEP has an updated detailed test plan section filled out
  • [ ] KEP has up to date graduation criteria
  • [ ] KEP has a production readiness review that has been completed and merged into k/enhancements.

It looks like for this one, we would need to:

  • The KEP needs updating most of the content of the KEP pull request as stated in the latest KEP
  • Update the content of the Test Plan Section updated detailed test plan to incorporate more details.
  • Fill up the Graduation section in the KEP with proper metadata
  • Adding a dedicated Design Details section in the KEP and move the Test plan and Graduation sub-sections under this section.
  • Create a kep.yaml file reflecting the latest milestone and stage information. Here is an example for reference.
  • Create production readiness review file stating the KEP-issue number, the stage you are planning for this release cycle(in this case alpha) and the approver. Here is an example for reference.

Open PR https://github.com/kubernetes/enhancements/pull/3004 addressing ^

For note, the status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

Atharva-Shinde avatar Jun 10 '22 21:06 Atharva-Shinde

/stage alpha

Atharva-Shinde avatar Jun 13 '22 11:06 Atharva-Shinde

Hello @marquiz πŸ‘‹, just a quick check-in again.

The enhancements freeze for 1.25 starts on this Thursday, June 16, 2022 at 18:00 PM PT.

Please try to get the above mentioned action-items done before enhancements freeze :)

Note: the current status of the enhancement is still marked at-risk.

Atharva-Shinde avatar Jun 13 '22 14:06 Atharva-Shinde

Thanks @Atharva-Shinde for the help!

I now did the following updates:

  • synced with the latest KEP template
  • added graduation criteria
  • updated test plan
  • add kep.yaml
  • add (placeholder) production readiness review file

We'll review this in SIG-Node tomorrow so more updates after that.

marquiz avatar Jun 13 '22 16:06 marquiz

Hey @marquiz πŸ‘‹ A good news! Enhancements Freeze is now extended to next week till Thursday June 23, 2022 πŸš€ So we now have one more week to submit the KEP :)

Atharva-Shinde avatar Jun 14 '22 15:06 Atharva-Shinde

Hello @marquiz πŸ‘‹, just a quick check-in again, as we approach the 1.25 enhancements freeze.

Please plan to get the open PR https://github.com/kubernetes/enhancements/pull/3004 merged before enhancements freeze on Thursday, June 23, 2022 at 18:00 PM PT which is just over 3 days away from now.

For note, the current status of the enhancement is atat-risk. Thank you!

Priyankasaggu11929 avatar Jun 21 '22 05:06 Priyankasaggu11929

Hello, 1.25 Enhancements Lead here πŸ‘‹. With Enhancements Freeze now in effect, this enhancement has not met the criteria for the freeze and has been removed from the milestone.

As a reminder, the criteria for enhancements freeze is:

  • KEP file using the latest template has been merged into the k/enhancements repo, with up to date latest milestone and stage
  • KEP status is marked as implementable
  • KEP has an updated detailed test plan section filled out
  • KEP has up to date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements.

Feel free to file an exception to add this back to the release. If you plan to do so, please file this as early as possible.

Thanks! /milestone clear

Priyankasaggu11929 avatar Jun 24 '22 01:06 Priyankasaggu11929

Hi @Atharva-Shinde @Priyankasaggu11929 I've retitled the PR (#3004) in order to reduce confusion misconceptions wrt some other KEPs and earlier work. Is it ok to retitle this issue as well?

marquiz avatar Jul 08 '22 14:07 marquiz

Hello @marquiz, retitling the issue is perfectly fine. Thank you! :)

Priyankasaggu11929 avatar Jul 08 '22 14:07 Priyankasaggu11929

/retitle QoS-class resources

marquiz avatar Jul 08 '22 15:07 marquiz

I have a query: how is this different from the built-in cpu resource? Linux blockio lets you configure controls such as blkio.throttle.read_bps_device and similarly, for CPU you can define requests and limits.

If the blockio case is like the existing cpu approach, then I'm wary of permanently complicated the Kubernetes Pod API to support a particular, vendor specific technology.

If we want to let different Pods share resources, we should aim to make a much more generic mechanism. For example, allow two different Pods in the same namespace to aggregate their cpu limit, agreeing between those two Pods to co-operate if they are scheduled onto the same node. Once we can share cpu limits, we can look at extending that sharing to other kinds of resource such as an extended resource.

At the very least, I'd like to see the sort of thing I'm proposing clearly called out as an alternative in the KEP, before we merge it.

sftim avatar Jul 08 '22 15:07 sftim

Hi @sftim, thanks for the review!

I have a query: how is this different from the built-in cpu resource? Linux blockio lets you configure controls such as blkio.throttle.read_bps_device and similarly, for CPU you can define requests and limits.

Blkio is just one possible usage for this. At least one fundamental difference between blkio and cpu is that the "amount of blkio" is not (ac)countable in any meaningful way. For cpu we know how much there is and there are meaningful controls to allocate a portion of that. For blkio its more of throttling: there are potentially a multitude of devices which is hard to predict which ones are actually used by a pod and potentially all of the different storage devices have different characteristics (parameters), think about SSD vs. rotational drives etc.

If the blockio case is like the existing cpu approach, then I'm wary of permanently complicated the Kubernetes Pod API to support a particular, vendor specific technology.

There isn't anything vendor specific in this proposal. One example is an Intel technology but even that is based on a generic interface in the Linux kernel (resctrlfs) that also other vendors' corresponding technologies use.

If we want to let different Pods share resources, we should aim to make a much more generic mechanism. For example, allow two different Pods in the same namespace to aggregate their cpu limit, agreeing between those two Pods to co-operate if they are scheduled onto the same node. Once we can share cpu limits, we can look at extending that sharing to other kinds of resource such as an extended resource.

I wouldn't identify this as a resource sharing mechanism between pods. Yes, in some cases they might end up using the same resource but generally that's not the case. In the case of blockio the class would just specify the throttling/weight parameters for storage devices but it doesn't state anything what particular devices are used by a pod. Similarly for RDT, the class might determine what portion of cache it can use or how much memory bandwidth it can use but it doesn't say anything about which CPUs the pod is running on (i.e. which cache IDs it is using). In theses cases wwo pods belonging in the same class generally means that they have the "same level of throttling"

At the very least, I'd like to see the sort of thing I'm proposing clearly called out as an alternative in the KEP, before we merge it.

At least for now I think they are two different things.

marquiz avatar Jul 08 '22 17:07 marquiz

/milestone v1.26 /label lead-opted-in (I'm doing this on behalf of @ruiwen-zhao / SIG-node)

marosset avatar Sep 30 '22 18:09 marosset

To clarify why I think blkio is vendor-specific: only Linux nodes have this resource. Windows nodes have CPU and memory but they don't have blkio or a direct equivalent.

I'd like the KEP to make the difference clear to a reader who knows Kubernetes but isn't particular familiar with any of the QoS mechanisms that we propose to integrate with.

sftim avatar Oct 01 '22 19:10 sftim

Hey @marquizπŸ‘‹, 1.26 Enhancements team here!

Just checking in as we approach Enhancements Freeze on 18:00 PDT on Thursday 6th October 2022.

This enhancement is targeting for stage alpha for 1.26

Here's where this enhancement currently stands:

  • [X] KEP file using the latest template has been merged into the k/enhancements repo.
  • [ ] KEP status is marked as implementable
  • [X] KEP has an updated detailed test plan section filled out
  • [X] KEP has up to date graduation criteria
  • [ ] KEP has a production readiness review that has been completed and merged into k/enhancements.

For this KEP, we would need to:

  • Change the status of the KEP from provisional to implementable and add reviewers/approvers
  • Get this PR #3004 merged before Enhancements Freeze to make this enhancement eligible for 1.26 release.

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you :)

Atharva-Shinde avatar Oct 02 '22 12:10 Atharva-Shinde

Hello @marquiz πŸ‘‹, just a quick check-in again, as we approach the 1.26 Enhancements freeze.

Please plan to get the action items mentioned in my comment above done before Enhancements freeze on 18:00 PDT on Thursday 6th October 2022 i.e tomorrow

For note, the current status of the enhancement is marked at-risk :)

Atharva-Shinde avatar Oct 05 '22 16:10 Atharva-Shinde

Hello πŸ‘‹, 1.26 Enhancements Lead here.

Unfortunately, this enhancement did not meet requirements for enhancements freeze.

If you still wish to progress this enhancement in v1.26, please file an exception request. Thanks!

/milestone clear /label tracked/no /remove-label tracked/yes /remove-label lead-opted-in

rhockenbury avatar Oct 07 '22 01:10 rhockenbury

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 05 '23 02:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 04 '23 03:02 k8s-triage-robot

/remove-lifecycle rotten

kad avatar Feb 04 '23 09:02 kad

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 05 '23 10:05 k8s-triage-robot

/remove-lifecycle stale The KEP is actively reviewed, and part of 1.28 SIG-Node plan

kad avatar May 05 '23 10:05 kad

/milestone v1.28

SergeyKanzhelev avatar May 05 '23 22:05 SergeyKanzhelev

/label lead-opted-in

SergeyKanzhelev avatar Jun 08 '23 07:06 SergeyKanzhelev