enhancements
enhancements copied to clipboard
KEP-4622: New TopologyManager Policy: max-allowable-numa-nodes
- One-line PR description: New TopologyManager Policy: max-allowable-numa-nodes
- Issue link: https://github.com/kubernetes/enhancements/issues/4622
- Other comments: https://github.com/kubernetes/kubernetes/pull/124148
Hi @klueska @ffromani, Here is a draft KEP, I would appreciate it if you could review it for me!
Hi @klueska, Do you have a few comments on these files?
I've added this to the tracking sheet for 1.31: https://docs.google.com/document/d/1U10J0WwgWXkdYrqWGGvO8iH2HKeerQAlygnqgDgWv4E/edit
@ffromani please let me know when this is ready for me to review
Hi @ffromani, Could you have time to review the latest changes? thanks.
From a technical standpoint, I think this KEP is now ready. We just need to tighten up the answers to the PRR questionaire.
/approve
If I have 1000 clusters (or 10s of thousands!), a metric to correlate failures is WAY better than digging into individual node config options on individual clusters.
@johnbelamaric so a per-node metric where each kubelet advertises the value of this new option on its /metrics endpoint?
Maybe? I’ll let Joe chime in, he is closer to the details. Do we have an admission duration metric already? We can already tell if the feature gate is enabled based on a pre-existing metric. But that doesn’t help post-GA.
/approve For PRR, with the agreement that we incorporate what was discussed here into the KEP: https://github.com/kubernetes/enhancements/pull/4624#discussion_r1638657947. (It's after working hours for @klueska and @cyclinder and I don't want this to miss enhancements freeze for something that we're mostly in agreement on already)
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: cyclinder, jpbetz, klueska, mrunalp
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~keps/prod-readiness/OWNERS~~ [jpbetz]
- ~~keps/sig-node/OWNERS~~ [mrunalp]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
/lgtm
I’ll work tomorrow with @cyclinder to get the text updated in a new PR.
Doh. It looks like it needs a rebase now that the bug has been fixed that was spell checking his username.
/retest
Just needs a TOC update
Thank you guys for the review ❤️! I've rebased https://github.com/kubernetes/enhancements/pull/4720, and now this PR is time to merge.
/lgtm
reviewed changes since last lgtm and it's a trivial rebase + toc update
Since this PR was delayed anyway I was hoping we could get the text updated as @jpbetz suggested before merging this.
Now that it's merged though, @cyclinder can you open a new PR to add some extra text around what @jpbetz suggested here: https://github.com/kubernetes/enhancements/pull/4624#discussion_r1638757893
thanks @klueska sure, I've updated some text to follow your advice(https://github.com/kubernetes/enhancements/pull/4624#discussion_r1638657947), and for https://github.com/kubernetes/enhancements/pull/4624#discussion_r1638757893, https://github.com/kubernetes/enhancements/pull/4624#discussion_r1639134146 are some response, I have no idea what about a soft limit, I think it will just increase the startup time of the pod, which can't be considered a RISK seriously, right? maybe we should figure out the linear relationship between the number of NUMANodes of a node affecting the startup time of a pod before GA.