node-feature-discovery nfd-topology-updater: Detect E/P cores and expose through attributes

trafficstars

Detect which CPUs are which types of the cores (P-cores or E-cores) and expose IDs through labels.

Jun 07 '24 09:06 ozhuraki

Hi @ozhuraki. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 07 '24 09:06 k8s-ci-robot

Deploy Preview for kubernetes-sigs-nfd ready!

Name	Link
Latest commit	2946157c6f7d8903307da737bc5a6c48172891be
Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-nfd/deploys/67599acefb726600082bb9a9
Deploy Preview	https://deploy-preview-1737--kubernetes-sigs-nfd.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Jun 07 '24 09:06 netlify[bot]

I have several concerns regarding this PR:

way of detecting cores. We should use /sys/devices/cpu_*/*cpus for getting information about P/E cores. not something else.
Those subdirs cpu_* are present only on hybrid CPUs, so if we have only one subdir /sys/devices/cpu/, it means only one type of the cores exists in the package, thus no need to expose those attributes at all. In other words: labels should be exposed only if CPU is hybrid.
Even on hybrid CPUs, those labels should not be by default populated, only when configuration requests that. I agree with Markus that those labels are not directly consumable, so extra step to enable them on hybrid CPUs is fine.
Exposing it as attributes in topology exporter is fine. There we can put them without configuration option if running on hybrid CPUs.
Names of the types of the cores should not be hard-coded, but taken from sysfs. e.g.

$ grep "" /sys/devices/cpu_*/*cpus
/sys/devices/cpu_atom/cpus:16-19
/sys/devices/cpu_core/cpus:0-15
$

so, the labels should have "foo..atom" and "foo...core" names. @marquiz please suggest some good prefix instead of "foo..."

Jul 12 '24 07:07 kad

@marquiz @kad

Thanks for the helpful input. Updated, please take a look

Aug 19 '24 11:08 ozhuraki

@marquiz

Thanks for the useful input! Updated, please take a look

Nov 04 '24 18:11 ozhuraki

@marquiz @kad

Thanks for the help! Updated, please take a look

Nov 06 '24 15:11 ozhuraki

@marquiz

Thanks, updated, please take a look

Nov 08 '24 14:11 ozhuraki

@marquiz

Thanks, updated, please take a look

Nov 11 '24 12:11 ozhuraki

/retest

Nov 28 '24 13:11 marquiz

One question and sorry if I misunderstand how hybrid cpus can be consumed in K8s. Wouldn't be more useful to just modify ResourceAggregator to advertise new resources for each cpu type in each Zone (NUMA node) by mapping generic k8s cpu resource to each cpu_type? But I may be wrong as I don't know how currently those E/P cores are assigned and consumed in Kubernetes.

Nov 28 '24 14:11 PiotrProkop

@PiotrProkop

There's nothing special about it, just one way of exposing such information.

We can also modify ResourceAggregator to advertise new resources for each cpu type too.

I will make an issue and will try to add it with a sepatate PR.

Nov 29 '24 10:11 ozhuraki

I just don't know how this information provided via attributes can be useful to custom scheduler without mapping to already used cpus. NRT object only exposes info about allocated/free cpus in given Zone without an info which cpu id is used and which is free.

Nov 29 '24 14:11 PiotrProkop

@PiotrProkop

Thanks and sorry, missed you comment.

Yes, it's merely exposing an information for particuar kind of cores for workloads that just want particular kind of cores on hybrid cpus.

I made an issue #1964 to expose it through ResourceAggregator (so the custom scheduler could track allocated/free cores) and will try to add such PR too.

@uniemimu

Thanks, updated.

Dec 11 '24 11:12 ozhuraki

I still feel like this information should be then exposed via labels and just say that given node has e or p cores without the numbers. And then no special scheduler plugin is needed but I may be wrong on what's the usecase here. I'll ask @ffromani to chime in.

Dec 11 '24 11:12 PiotrProkop

@PiotrProkop

We initially had it through labels but later switched to annotations since labels were restrictive on exposing IDs.

I can additionaly expose via labels and just say that given node has E or P, as you suggest.

Dec 11 '24 11:12 ozhuraki

@PiotrProkop

Added labels too, please take a look.

Dec 11 '24 14:12 ozhuraki

@ozhuraki: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-node-feature-discovery-build-image-cross-generic	2946157c6f7d8903307da737bc5a6c48172891be	link	true	`/test pull-node-feature-discovery-build-image-cross-generic`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Dec 11 '24 16:12 k8s-ci-robot

/cc

Dec 16 '24 10:12 ffromani

got the ping, will review ASAP

Dec 16 '24 10:12 ffromani

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marquiz, ozhuraki, uniemimu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [marquiz]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Dec 19 '24 12:12 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 20 '25 13:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Apr 19 '25 13:04 k8s-triage-robot

/remove-lifecycle rotten

May 06 '25 19:05 ArangoGutierrez

@ozhuraki, can you rebase so we can revive this initiative?

May 06 '25 19:05 ArangoGutierrez

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 04 '25 20:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Sep 04 '25 07:09 k8s-triage-robot

/remove-lifecycle rotten

Sep 04 '25 07:09 ffromani

node-feature-discovery node-feature-discovery copied to clipboard

nfd-topology-updater: Detect E/P cores and expose through attributes

✅ Deploy Preview for kubernetes-sigs-nfd ready!

node-feature-discovery
node-feature-discovery copied to clipboard

Deploy Preview for kubernetes-sigs-nfd ready!