enhancements icon indicating copy to clipboard operation
enhancements copied to clipboard

KEP-4622: New TopologyManager Policy: max-allowable-numa-nodes

Open cyclinder opened this issue 9 months ago • 3 comments

  • One-line PR description: New TopologyManager Policy: max-allowable-numa-nodes
  • Issue link: https://github.com/kubernetes/enhancements/issues/4622
  • Other comments: https://github.com/kubernetes/kubernetes/pull/124148

cyclinder avatar May 08 '24 08:05 cyclinder

Hi @klueska @ffromani, Here is a draft KEP, I would appreciate it if you could review it for me!

cyclinder avatar May 08 '24 08:05 cyclinder

Hi @klueska, Do you have a few comments on these files?

cyclinder avatar May 10 '24 10:05 cyclinder

I've added this to the tracking sheet for 1.31: https://docs.google.com/document/d/1U10J0WwgWXkdYrqWGGvO8iH2HKeerQAlygnqgDgWv4E/edit

@ffromani please let me know when this is ready for me to review

klueska avatar May 16 '24 11:05 klueska

Hi @ffromani, Could you have time to review the latest changes? thanks.

cyclinder avatar Jun 05 '24 03:06 cyclinder

From a technical standpoint, I think this KEP is now ready. We just need to tighten up the answers to the PRR questionaire.

/approve

klueska avatar Jun 12 '24 09:06 klueska

If I have 1000 clusters (or 10s of thousands!), a metric to correlate failures is WAY better than digging into individual node config options on individual clusters.

johnbelamaric avatar Jun 12 '24 10:06 johnbelamaric

@johnbelamaric so a per-node metric where each kubelet advertises the value of this new option on its /metrics endpoint?

klueska avatar Jun 12 '24 11:06 klueska

Maybe? I’ll let Joe chime in, he is closer to the details. Do we have an admission duration metric already? We can already tell if the feature gate is enabled based on a pre-existing metric. But that doesn’t help post-GA.

johnbelamaric avatar Jun 12 '24 11:06 johnbelamaric

/approve For PRR, with the agreement that we incorporate what was discussed here into the KEP: https://github.com/kubernetes/enhancements/pull/4624#discussion_r1638657947. (It's after working hours for @klueska and @cyclinder and I don't want this to miss enhancements freeze for something that we're mostly in agreement on already)

jpbetz avatar Jun 13 '24 19:06 jpbetz

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cyclinder, jpbetz, klueska, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jun 13 '24 19:06 k8s-ci-robot

/lgtm

I’ll work tomorrow with @cyclinder to get the text updated in a new PR.

klueska avatar Jun 13 '24 20:06 klueska

Doh. It looks like it needs a rebase now that the bug has been fixed that was spell checking his username.

klueska avatar Jun 13 '24 20:06 klueska

/retest

johnbelamaric avatar Jun 13 '24 20:06 johnbelamaric

Just needs a TOC update

johnbelamaric avatar Jun 13 '24 20:06 johnbelamaric

Thank you guys for the review ❤️! I've rebased https://github.com/kubernetes/enhancements/pull/4720, and now this PR is time to merge.

cyclinder avatar Jun 14 '24 02:06 cyclinder

/lgtm

reviewed changes since last lgtm and it's a trivial rebase + toc update

ffromani avatar Jun 14 '24 05:06 ffromani

Since this PR was delayed anyway I was hoping we could get the text updated as @jpbetz suggested before merging this.

Now that it's merged though, @cyclinder can you open a new PR to add some extra text around what @jpbetz suggested here: https://github.com/kubernetes/enhancements/pull/4624#discussion_r1638757893

klueska avatar Jun 14 '24 06:06 klueska

thanks @klueska sure, I've updated some text to follow your advice(https://github.com/kubernetes/enhancements/pull/4624#discussion_r1638657947), and for https://github.com/kubernetes/enhancements/pull/4624#discussion_r1638757893, https://github.com/kubernetes/enhancements/pull/4624#discussion_r1639134146 are some response, I have no idea what about a soft limit, I think it will just increase the startup time of the pod, which can't be considered a RISK seriously, right? maybe we should figure out the linear relationship between the number of NUMANodes of a node affecting the startup time of a pod before GA.

cyclinder avatar Jun 14 '24 10:06 cyclinder