tuned icon indicating copy to clipboard operation
tuned copied to clipboard

Uncore perf

Open spandruvada opened this issue 2 years ago • 17 comments

Hello maintainers,

This pull request for uncore power optimization on Intel servers. Please provide feedback.

spandruvada avatar Mar 22 '23 15:03 spandruvada

I just tested this patch on a single socket Sapphire Rapids system, running a 6.3.0-0.rc4.35.eln126.x86_64 kernel.

I did see power savings at idle, (12% when C1 was not pinned, and 8% when C1 was pinned).
Linpack's performance did drop 1.6%. I did no other testing.

While I agree this would be good to get into TuneD, I'm not sure doing it with a new TuneD profile is the way to go. The profile which is part of this patchset is throughput-performance. That means if someone wanted to use this feature with another profile, they wouldn't be able to.

Question back to Jaroslav and the TuneD maintainers: Wouldn't it be better to create configuration files so that power saving tuning knobs like this one could be turned on/off in any TuneD profile?

joemario avatar Mar 28 '23 18:03 joemario

To add a clarification to my earlier comment, there was no power savings when the cpus were busy. Those 8% and 12% power savings were only for idle systems.

joemario avatar Mar 28 '23 18:03 joemario

To add a clarification to my earlier comment, there was no power savings when the cpus were busy. Those 8% and 12% power savings were only for idle systems.

When CPUs are 100% busy and accessing lots of memory all the time, then saving will not be seen. But workloads has idle times and also don't access uncore constantly to keep uncore high. So overall this is good knob. It is fine it is deployed using some other method than suggested here.

spandruvada avatar Apr 04 '23 17:04 spandruvada

Wouldn't it be better to create configuration files so that power saving tuning knobs like this one could be turned on/off in any TuneD profile?

Absolutely. I'd personally like to avoid adding yet another profile. The "knobs" that could make use of the functionality in any TuneD profile would be nice though.

jmencak avatar Apr 06 '23 10:04 jmencak

Wouldn't it be better to create configuration files so that power saving tuning knobs like this one could be turned on/off in any TuneD profile?

Absolutely. I'd personally like to avoid adding yet another profile. The "knobs" that could make use of the functionality in any TuneD profile would be nice though.

Do you mean that add a knob to: tuned-main.conf ?

spandruvada avatar Apr 07 '23 20:04 spandruvada

Absolutely. I'd personally like to avoid adding yet another profile. The "knobs" that could make use of the functionality in any TuneD profile would be nice though.

Do you mean that add a knob to: tuned-main.conf ?

Personally, that's not what I had in mind. tuned-main.conf is a global configuration file and these "knobs" probably should not go there. I did not thought this through completely yet, but I was thinking to have the "knobs" in a new configuration file under /etc/tuned/something.conf. Please see /etc/tuned/cpu-partitioning-variables.conf for example. @yarda, do you think that would make more sense or do you have a better suggestion?

First, we probably need to agree what "knobs"/variables/tunables we need alongside with their names (just uncore_max_delta_mhz ?) and take it from there.

jmencak avatar Apr 08 '23 16:04 jmencak

I did not thought this through completely yet, but I was thinking to have the "knobs" in a new configuration file under /etc/tuned/something.conf. Please see /etc/tuned/cpu-partitioning-variables.conf for example. @yarda, do you think that would make more sense or do you have a better suggestion?

Hi Jiri, Jaroslav, and Srinivas: Jiri's comment is similar to what I was thinking as well.

For example, we could:

  1. Have a /etc/tuned/power-saving-variables.conf file.
  2. That file would contain a list of all powersaving knobs that TuneD supports, initially all commented out.
  3. Users could uncomment and set them as desired.
  4. This power-saving.conf file could be included in all the RHEL profiles. It would only cause power-savings if a user uncommented and set any of the knobs in that file.

I would defer to Jaroslav on his preferred approach for doing something like this.

joemario avatar Apr 08 '23 17:04 joemario

I did not thought this through completely yet, but I was thinking to have the "knobs" in a new configuration file under /etc/tuned/something.conf. Please see /etc/tuned/cpu-partitioning-variables.conf for example. @yarda, do you think that would make more sense or do you have a better suggestion?

Hi Jiri, Jaroslav, and Srinivas: Jiri's comment is similar to what I was thinking as well.

For example, we could:

  1. Have a /etc/tuned/power-saving-variables.conf file.
  2. That file would contain a list of all powersaving knobs that TuneD supports, initially all commented out.
  3. Users could uncomment and set them as desired.
  4. This power-saving.conf file could be included in all the RHEL profiles. It would only cause power-savings if a user uncommented and set any of the knobs in that file.

I would defer to Jaroslav on his preferred approach for doing something like this.

Hi Jiri, Jaroslav, Joe

We can resubmit as suggested by Joe. Please let us know.

spandruvada avatar Apr 10 '23 19:04 spandruvada

I did not thought this through completely yet, but I was thinking to have the "knobs" in a new configuration file under /etc/tuned/something.conf. Please see /etc/tuned/cpu-partitioning-variables.conf for example. @yarda, do you think that would make more sense or do you have a better suggestion?

Hi Jiri, Jaroslav, and Srinivas: Jiri's comment is similar to what I was thinking as well. For example, we could:

  1. Have a /etc/tuned/power-saving-variables.conf file.
  2. That file would contain a list of all powersaving knobs that TuneD supports, initially all commented out.
  3. Users could uncomment and set them as desired.
  4. This power-saving.conf file could be included in all the RHEL profiles. It would only cause power-savings if a user uncommented and set any of the knobs in that file.

I would defer to Jaroslav on his preferred approach for doing something like this.

Hi Jiri, Jaroslav, Joe

We can resubmit as suggested by Joe. Please let us know.

Hi Joe,

I created /etc/tuned/power-saving-variables.conf with contents. I added some example values for test

# governor=performance energy_perf_bias=performance # min_perf_pct=100 # max_perf_pct=100 # energy_performance_preference=balance_performance # force_latency=cstate.name:C6|cstate.id:4|10 # pm_qos_resume_latency_us

///////////// Suppose you include this in profiles/balanced/tuned.conf

[main] summary=General non-specialized tuned profile

[variables] include=/etc/tuned/power-saving-variables.conf

[modules] cpufreq_conservative=+r

[cpu] priority=10 governor=conservative|powersave

# energy_perf_bias=normal "This is what balanced defines" # but user has defined energy_perf_bias=performance # /etc/tuned/power-saving-variables.conf

energy_perf_bias=${energy_perf_bias}

There is no way to say that if ${energy_perf_bias} is defined use the value else use energy_perf_bias=normal

So what you are suggesting, may not be possible with current tuned?

spandruvada avatar Apr 19 '23 23:04 spandruvada

Does this rely on the intel_uncore kmod?

Will this allow similar functionality in tuned like we've been able to use in the past via the msr-tools package, specifically the binaries rdmsr and wrmsr user space applications? In some cases the uncore frequency is set in server platform BIOS and has no knobs to control / modify without modifying registers, which isn't user friendly.

novacain1 avatar Jun 12 '23 18:06 novacain1

For the configurations we expect users usually to customize, the variables are way to go. If users are not expected to usually customize the configuration then new profile is probably a better solution.

In this specific case I would also prefer variables, i.e. the .conf file.

There is no way to say that if ${energy_perf_bias} is defined use the value else use energy_perf_bias=normal

IMHO this should be possible to implement with the ${f:regex_search_ternary} built in function, I could prepare some proof of concept. Something semantically similar is already used in the realtime and cpu-partitioning profiles.

Regarding the current implementation in this PR:

  • from the kernel documentation it seems this tuning knob is per package die, thus I expect there could be multiple such knobs on the system accessible through subdirs, but If I am not mistaken in the current PR all are controlled in the first plugin instance. I think better would be to add support per device which would allow setting all the knobs individually, e.g.:
[cpus_group1]
devices=cpu1,cpu2,cpu3,...
type=cpu
uncore_max_delta_mhz= ${VALUE1}

[cpus_group2]
devices=cpu16,cpu17,cpu18,...
type=cpu
uncore_max_delta_mhz= ${VALUE2}
  • Why is it called uncore_max_delta_mhz? For no confusion wouldn't be better to use the same name and units as are used in the kernel sysfs?
  • I think verification should be also supported in the code for the tuned-adm verify to work.

yarda avatar Jun 22 '23 10:06 yarda

Hi, I would like to move this PR forward.

IIUC we have two separate issues here, first is adding uncore frequency knob to cpu plugin, and second incorporate the knob into existing profiles (via variable.conf). For now I would like to concentrate on the first issue - a, once that done move to the second one.

  • from the kernel documentation it seems this tuning knob is per package die, thus I expect there could be multiple such knobs on the system accessible through subdirs, but If I am not mistaken in the current PR all are controlled in the first plugin instance. I think better would be to add support per device which would allow setting all the knobs individually, e.g.:
[cpus_group1]
devices=cpu1,cpu2,cpu3,...
type=cpu
uncore_max_delta_mhz= ${VALUE1}

[cpus_group2]
devices=cpu16,cpu17,cpu18,...
type=cpu
uncore_max_delta_mhz= ${VALUE2}

It's of cource reasonable to configure this per cpu. The one problem I see is that the all cpu's in the config might not necessary located in the same die. Basically will be required that devices='cpu list' is configured to die. That's ok I think , we can log error and make verify fail , if the configuration is not correct.

  • Why is it called uncore_max_delta_mhz? For no confusion wouldn't be better to use the same name and units as are used in the kernel sysfs?

Yes, it would be better to have the same units obviously. Regarding name it's delta, because we don't know apriori what is the maximum frequency. We read that value from sysfs intial_max_freq_khz and subtract the delta and write to max_freq_khz.

sgruszka avatar Nov 28 '23 10:11 sgruszka

Does this rely on the intel_uncore kmod?

Yes.

Will this allow similar functionality in tuned like we've been able to use in the past via the msr-tools package, specifically the binaries rdmsr and wrmsr user space applications? In some cases the uncore frequency is set in server platform BIOS and has no knobs to control / modify without modifying registers, which isn't user friendly.

Yes, I think using sysfs intel_uncore_frequency would be preferred to configure uncore frequency over rdmsr/wrmsr.

sgruszka avatar Nov 28 '23 10:11 sgruszka

  • from the kernel documentation it seems this tuning knob is per package die, thus I expect there could be multiple such knobs on the system accessible through subdirs, but If I am not mistaken in the current PR all are controlled in the first plugin instance. I think better would be to add support per device which would allow setting all the knobs individually, e.g.:
[cpus_group1]
devices=cpu1,cpu2,cpu3,...
type=cpu
uncore_max_delta_mhz= ${VALUE1}

[cpus_group2]
devices=cpu16,cpu17,cpu18,...
type=cpu
uncore_max_delta_mhz= ${VALUE2}

It's of cource reasonable to configure this per cpu. The one problem I see is that the all cpu's in the config might not necessary located in the same die. Basically will be required that devices='cpu list' is configured to die. That's ok I think , we can log error and make verify fail , if the configuration is not correct.

According to the kernel documentation (https://docs.kernel.org/admin-guide/pm/intel_uncore_frequency_scaling.html) this is configured per package and die combination - not per CPU. So I don't see how or why we would try to configure this for specific CPUs in tuned.

  • Why is it called uncore_max_delta_mhz? For no confusion wouldn't be better to use the same name and units as are used in the kernel sysfs?

Yes, it would be better to have the same units obviously. Regarding name it's delta, because we don't know apriori what is the maximum frequency. We read that value from sysfs intial_max_freq_khz and subtract the delta and write to max_freq_khz.

I believe that in at least some cases, either Intel or the HW vendor is going to require a specific uncore frequency to be set. I think tuned needs to allow the uncore max_freq_khz and min_freq_khz to be set specifically - not using deltas. Or perhaps we should allow either method to be used?

bartwensley avatar Nov 28 '23 14:11 bartwensley

According to the kernel documentation (https://docs.kernel.org/admin-guide/pm/intel_uncore_frequency_scaling.html) this is configured per package and die combination - not per CPU. So I don't see how or why we would try to configure this for specific CPUs in tuned.

We know about topology, the die_id can be read from sysfs i.e for cpu6 is: /sys/devices/system/cpu/cpu6/topology/die_id

However there are uncore's that might not contain cpu's (there are uncore* entries additionally to package_die entries in . /sys/devices/system/cpu/intel_uncore_frequency). So this is not that simple. One global variable that control all uncore's as provided this PR is the simplest solution.

I believe that in at least some cases, either Intel or the HW vendor is going to require a specific uncore frequency to be set. I think tuned needs to allow the uncore max_freq_khz and min_freq_khz to be set specifically - not using deltas. Or perhaps we should allow either method to be used?

I'm open to suggestions here. It can be direct value or percentage of max for example. However direct value would not be portable from system to system, hence I think both configuration methods indeed should be provided (direct and percentage/delta)

sgruszka avatar Nov 28 '23 15:11 sgruszka

I've opened another PR . Please check it out to see if it goes to the right direction. It allows to define uncore freq per cpu (the config is checked against topology) . Hope it's something reasonable. Otherwise what I think could be done is one global option (as in this PR) or separate intel_uncore plugin where device is defined as entry in /sys/devices/system/cpu/intel_uncore_frequency/* i.e. package_NN_die_MM or uncoreNN .

sgruszka avatar Dec 04 '23 13:12 sgruszka

Uncore support has been merged, if you want the setting in the throughput-performance profile, please update this PR or create new one.

yarda avatar Jul 25 '24 11:07 yarda