node-feature-discovery icon indicating copy to clipboard operation
node-feature-discovery copied to clipboard

nfd-topology-updater: Detect E/P cores and expose through attributes

Open ozhuraki opened this issue 1 year ago • 24 comments
trafficstars

Detect which CPUs are which types of the cores (P-cores or E-cores) and expose IDs through labels.

ozhuraki avatar Jun 07 '24 09:06 ozhuraki

Hi @ozhuraki. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 07 '24 09:06 k8s-ci-robot

Deploy Preview for kubernetes-sigs-nfd ready!

Name Link
Latest commit 2946157c6f7d8903307da737bc5a6c48172891be
Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-nfd/deploys/67599acefb726600082bb9a9
Deploy Preview https://deploy-preview-1737--kubernetes-sigs-nfd.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Jun 07 '24 09:06 netlify[bot]

I have several concerns regarding this PR:

  1. way of detecting cores. We should use /sys/devices/cpu_*/*cpus for getting information about P/E cores. not something else.
  2. Those subdirs cpu_* are present only on hybrid CPUs, so if we have only one subdir /sys/devices/cpu/, it means only one type of the cores exists in the package, thus no need to expose those attributes at all. In other words: labels should be exposed only if CPU is hybrid.
  3. Even on hybrid CPUs, those labels should not be by default populated, only when configuration requests that. I agree with Markus that those labels are not directly consumable, so extra step to enable them on hybrid CPUs is fine.
  4. Exposing it as attributes in topology exporter is fine. There we can put them without configuration option if running on hybrid CPUs.
  5. Names of the types of the cores should not be hard-coded, but taken from sysfs. e.g.
$ grep "" /sys/devices/cpu_*/*cpus
/sys/devices/cpu_atom/cpus:16-19
/sys/devices/cpu_core/cpus:0-15
$

so, the labels should have "foo..atom" and "foo...core" names. @marquiz please suggest some good prefix instead of "foo..."

kad avatar Jul 12 '24 07:07 kad

@marquiz @kad

Thanks for the helpful input. Updated, please take a look

ozhuraki avatar Aug 19 '24 11:08 ozhuraki

@marquiz

Thanks for the useful input! Updated, please take a look

ozhuraki avatar Nov 04 '24 18:11 ozhuraki

@marquiz @kad

Thanks for the help! Updated, please take a look

ozhuraki avatar Nov 06 '24 15:11 ozhuraki

@marquiz

Thanks, updated, please take a look

ozhuraki avatar Nov 08 '24 14:11 ozhuraki

@marquiz

Thanks, updated, please take a look

ozhuraki avatar Nov 11 '24 12:11 ozhuraki

/retest

marquiz avatar Nov 28 '24 13:11 marquiz

One question and sorry if I misunderstand how hybrid cpus can be consumed in K8s. Wouldn't be more useful to just modify ResourceAggregator to advertise new resources for each cpu type in each Zone (NUMA node) by mapping generic k8s cpu resource to each cpu_type? But I may be wrong as I don't know how currently those E/P cores are assigned and consumed in Kubernetes.

PiotrProkop avatar Nov 28 '24 14:11 PiotrProkop

@PiotrProkop

There's nothing special about it, just one way of exposing such information.

We can also modify ResourceAggregator to advertise new resources for each cpu type too.

I will make an issue and will try to add it with a sepatate PR.

ozhuraki avatar Nov 29 '24 10:11 ozhuraki

I just don't know how this information provided via attributes can be useful to custom scheduler without mapping to already used cpus. NRT object only exposes info about allocated/free cpus in given Zone without an info which cpu id is used and which is free.

PiotrProkop avatar Nov 29 '24 14:11 PiotrProkop

@PiotrProkop

Thanks and sorry, missed you comment.

Yes, it's merely exposing an information for particuar kind of cores for workloads that just want particular kind of cores on hybrid cpus.

I made an issue #1964 to expose it through ResourceAggregator (so the custom scheduler could track allocated/free cores) and will try to add such PR too.

@uniemimu

Thanks, updated.

ozhuraki avatar Dec 11 '24 11:12 ozhuraki

I still feel like this information should be then exposed via labels and just say that given node has e or p cores without the numbers. And then no special scheduler plugin is needed but I may be wrong on what's the usecase here. I'll ask @ffromani to chime in.

PiotrProkop avatar Dec 11 '24 11:12 PiotrProkop

@PiotrProkop

We initially had it through labels but later switched to annotations since labels were restrictive on exposing IDs.

I can additionaly expose via labels and just say that given node has E or P, as you suggest.

ozhuraki avatar Dec 11 '24 11:12 ozhuraki

@PiotrProkop

Added labels too, please take a look.

ozhuraki avatar Dec 11 '24 14:12 ozhuraki

@ozhuraki: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-node-feature-discovery-build-image-cross-generic 2946157c6f7d8903307da737bc5a6c48172891be link true /test pull-node-feature-discovery-build-image-cross-generic

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot avatar Dec 11 '24 16:12 k8s-ci-robot

/cc

ffromani avatar Dec 16 '24 10:12 ffromani

got the ping, will review ASAP

ffromani avatar Dec 16 '24 10:12 ffromani

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marquiz, ozhuraki, uniemimu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Dec 19 '24 12:12 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 20 '25 13:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Apr 19 '25 13:04 k8s-triage-robot

/remove-lifecycle rotten

ArangoGutierrez avatar May 06 '25 19:05 ArangoGutierrez

@ozhuraki, can you rebase so we can revive this initiative?

ArangoGutierrez avatar May 06 '25 19:05 ArangoGutierrez

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 04 '25 20:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 04 '25 07:09 k8s-triage-robot

/remove-lifecycle rotten

ffromani avatar Sep 04 '25 07:09 ffromani