enhancements icon indicating copy to clipboard operation
enhancements copied to clipboard

kep-4622: promote topologyManager policy: max-allowable-numa-nodes to GA

Open cyclinder opened this issue 10 months ago • 12 comments

  • One-line PR description: topologyMnagaer policy: max-allowable-numa-nodes to GA(1.33)
  • Issue link: https://github.com/kubernetes/enhancements/issues/4622
  • Other comments:
    • it was created in 1.31 as a beta feature: https://github.com/kubernetes/enhancements/pull/4624
    • The GA graduation would mean:

cyclinder avatar Feb 12 '25 14:02 cyclinder

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cyclinder Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Feb 12 '25 14:02 k8s-ci-robot

I am supportive of graduating features to GA but not sure if we will get PRR in time as this came in late.

mrunalp avatar Feb 13 '25 01:02 mrunalp

I am supportive of graduating features to GA but not sure if we will get PRR in time as this came in late.

same. I'll still be helping here with the reviews and cleaning up the feature, but if we miss the deadline (as unfortunately seems likely) is not too bad: we can prepare and be ready for a early 1.34 merge.

ffromani avatar Feb 13 '25 09:02 ffromani

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 14 '25 10:05 k8s-triage-robot

/remove-lifecycle stale

ffromani avatar May 14 '25 10:05 ffromani

/retitle kep-4622: promote topologyManager policy: max-allowable-numa-nodes to GA

ffromani avatar May 14 '25 10:05 ffromani

I can only imagine as possible improvement to extend that metric.

Agree, do you mean that we need another metric for this option? We already have an existing metric: topology_manager_admission_duration_ms now.

Do we already have tests in the codebase to ensure that this policy option works as expected and is compatible with other Topology Manager policy options or planned as part of GA graduation?

We already have an e2e test for this in https://github.com/kubernetes/kubernetes/pull/124148. see https://github.com/kubernetes/kubernetes/blob/0154f8a222c38d5808f1aecf218379cedf15e8a7/test/e2e_node/topology_manager_test.go#L257

cyclinder avatar Jun 16 '25 11:06 cyclinder

I think we can have some e2e tests which happen to set this value and then run a subset of the existing topology manager tests. This should be easy, cheap enough and will give us enough coverage. I can't say if this warrants a beta2 or can be done in the context of the GA graduation.

I am open to having some other e2e tests, Do we need to do this in 1.34 or 1.35?

cyclinder avatar Jun 19 '25 00:06 cyclinder

I think we can have some e2e tests which happen to set this value and then run a subset of the existing topology manager tests. This should be easy, cheap enough and will give us enough coverage. I can't say if this warrants a beta2 or can be done in the context of the GA graduation.

I am open to having some other e2e tests, Do we need to do this in 1.34 or 1.35?

from what I collected, it's fair and accepted to add e2e tests in the same cycle on which we graduate to beta. But everything should be set before to move to GA.

ffromani avatar Jun 20 '25 06:06 ffromani

looks like the label lead-opt-in was missing, so this PR didn't got PRR review? Let's identify all the gaps from sig-node perspective, work on them and aim for early trivial GA move next cycle

ffromani avatar Jun 20 '25 06:06 ffromani

according to https://github.com/kubernetes/enhancements/pull/5242 my understanding is we need to have another beta to complete the missing e2e tests. Other than that LGTM me. Not adding the label because the PR wants to move to GA, and turns out we need a beta2 to address the gaps

According to my understanding, I need to submit another PR, similar to https://github.com/kubernetes/enhancements/pull/5242, and also add an e2e test to Kubernetes. Only after these are completed can this PR receive the required label and be merged, right?

cyclinder avatar Jun 27 '25 08:06 cyclinder

I think we can have some e2e tests which happen to set this value and then run a subset of the existing topology manager tests

We already have an e2e test for this in https://github.com/kubernetes/kubernetes/pull/124148. see https://github.com/kubernetes/kubernetes/blob/0154f8a222c38d5808f1aecf218379cedf15e8a7/test/e2e_node/topology_manager_test.go#L1449. . This is the most basic verification that the changed setting is not breaking stuff for maxNUMANodes(16), Do we still need an additional e2e test?

cyclinder avatar Jun 27 '25 08:06 cyclinder

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 25 '25 09:09 k8s-triage-robot

@haircommander I'm updating (preserving @cyclinder autorship of course) in https://github.com/kubernetes/enhancements/pull/5627

ffromani avatar Oct 08 '25 19:10 ffromani

got it thanks, I updated the description in https://github.com/kubernetes/enhancements/issues/4622 to reflect

haircommander avatar Oct 08 '25 19:10 haircommander