tuned
tuned copied to clipboard
Uncore perf
Hello maintainers,
This pull request for uncore power optimization on Intel servers. Please provide feedback.
I just tested this patch on a single socket Sapphire Rapids system, running a 6.3.0-0.rc4.35.eln126.x86_64 kernel.
I did see power savings at idle, (12% when C1 was not pinned, and 8% when C1 was pinned).
Linpack's performance did drop 1.6%. I did no other testing.
While I agree this would be good to get into TuneD, I'm not sure doing it with a new TuneD profile is the way to go. The profile which is part of this patchset is throughput-performance. That means if someone wanted to use this feature with another profile, they wouldn't be able to.
Question back to Jaroslav and the TuneD maintainers: Wouldn't it be better to create configuration files so that power saving tuning knobs like this one could be turned on/off in any TuneD profile?
To add a clarification to my earlier comment, there was no power savings when the cpus were busy. Those 8% and 12% power savings were only for idle systems.
To add a clarification to my earlier comment, there was no power savings when the cpus were busy. Those 8% and 12% power savings were only for idle systems.
When CPUs are 100% busy and accessing lots of memory all the time, then saving will not be seen. But workloads has idle times and also don't access uncore constantly to keep uncore high. So overall this is good knob. It is fine it is deployed using some other method than suggested here.
Wouldn't it be better to create configuration files so that power saving tuning knobs like this one could be turned on/off in any TuneD profile?
Absolutely. I'd personally like to avoid adding yet another profile. The "knobs" that could make use of the functionality in any TuneD profile would be nice though.
Wouldn't it be better to create configuration files so that power saving tuning knobs like this one could be turned on/off in any TuneD profile?
Absolutely. I'd personally like to avoid adding yet another profile. The "knobs" that could make use of the functionality in any TuneD profile would be nice though.
Do you mean that add a knob to: tuned-main.conf ?
Absolutely. I'd personally like to avoid adding yet another profile. The "knobs" that could make use of the functionality in any TuneD profile would be nice though.
Do you mean that add a knob to: tuned-main.conf ?
Personally, that's not what I had in mind. tuned-main.conf is a global configuration file and these "knobs" probably should not go there. I did not thought this through completely yet, but I was thinking to have the "knobs" in a new configuration file under /etc/tuned/something.conf. Please see /etc/tuned/cpu-partitioning-variables.conf for example. @yarda, do you think that would make more sense or do you have a better suggestion?
First, we probably need to agree what "knobs"/variables/tunables we need alongside with their names (just uncore_max_delta_mhz ?) and take it from there.
I did not thought this through completely yet, but I was thinking to have the "knobs" in a new configuration file under
/etc/tuned/something.conf. Please see/etc/tuned/cpu-partitioning-variables.conffor example. @yarda, do you think that would make more sense or do you have a better suggestion?
Hi Jiri, Jaroslav, and Srinivas: Jiri's comment is similar to what I was thinking as well.
For example, we could:
- Have a /etc/tuned/power-saving-variables.conf file.
- That file would contain a list of all powersaving knobs that TuneD supports, initially all commented out.
- Users could uncomment and set them as desired.
- This power-saving.conf file could be included in all the RHEL profiles. It would only cause power-savings if a user uncommented and set any of the knobs in that file.
I would defer to Jaroslav on his preferred approach for doing something like this.
I did not thought this through completely yet, but I was thinking to have the "knobs" in a new configuration file under
/etc/tuned/something.conf. Please see/etc/tuned/cpu-partitioning-variables.conffor example. @yarda, do you think that would make more sense or do you have a better suggestion?Hi Jiri, Jaroslav, and Srinivas: Jiri's comment is similar to what I was thinking as well.
For example, we could:
- Have a /etc/tuned/power-saving-variables.conf file.
- That file would contain a list of all powersaving knobs that TuneD supports, initially all commented out.
- Users could uncomment and set them as desired.
- This power-saving.conf file could be included in all the RHEL profiles. It would only cause power-savings if a user uncommented and set any of the knobs in that file.
I would defer to Jaroslav on his preferred approach for doing something like this.
Hi Jiri, Jaroslav, Joe
We can resubmit as suggested by Joe. Please let us know.
I did not thought this through completely yet, but I was thinking to have the "knobs" in a new configuration file under
/etc/tuned/something.conf. Please see/etc/tuned/cpu-partitioning-variables.conffor example. @yarda, do you think that would make more sense or do you have a better suggestion?Hi Jiri, Jaroslav, and Srinivas: Jiri's comment is similar to what I was thinking as well. For example, we could:
- Have a /etc/tuned/power-saving-variables.conf file.
- That file would contain a list of all powersaving knobs that TuneD supports, initially all commented out.
- Users could uncomment and set them as desired.
- This power-saving.conf file could be included in all the RHEL profiles. It would only cause power-savings if a user uncommented and set any of the knobs in that file.
I would defer to Jaroslav on his preferred approach for doing something like this.
Hi Jiri, Jaroslav, Joe
We can resubmit as suggested by Joe. Please let us know.
Hi Joe,
I created /etc/tuned/power-saving-variables.conf with contents. I added some example values for test
# governor=performance energy_perf_bias=performance # min_perf_pct=100 # max_perf_pct=100 # energy_performance_preference=balance_performance # force_latency=cstate.name:C6|cstate.id:4|10 # pm_qos_resume_latency_us
///////////// Suppose you include this in profiles/balanced/tuned.conf
[main] summary=General non-specialized tuned profile
[variables] include=/etc/tuned/power-saving-variables.conf
[modules] cpufreq_conservative=+r
[cpu] priority=10 governor=conservative|powersave
# energy_perf_bias=normal "This is what balanced defines" # but user has defined energy_perf_bias=performance # /etc/tuned/power-saving-variables.conf
energy_perf_bias=${energy_perf_bias}
There is no way to say that if ${energy_perf_bias} is defined use the value else use energy_perf_bias=normal
So what you are suggesting, may not be possible with current tuned?
Does this rely on the intel_uncore kmod?
Will this allow similar functionality in tuned like we've been able to use in the past via the msr-tools package, specifically the binaries rdmsr and wrmsr user space applications? In some cases the uncore frequency is set in server platform BIOS and has no knobs to control / modify without modifying registers, which isn't user friendly.
For the configurations we expect users usually to customize, the variables are way to go. If users are not expected to usually customize the configuration then new profile is probably a better solution.
In this specific case I would also prefer variables, i.e. the .conf file.
There is no way to say that if ${energy_perf_bias} is defined use the value else use energy_perf_bias=normal
IMHO this should be possible to implement with the ${f:regex_search_ternary} built in function, I could prepare some proof of concept. Something semantically similar is already used in the realtime and cpu-partitioning profiles.
Regarding the current implementation in this PR:
- from the kernel documentation it seems this tuning knob is per package die, thus I expect there could be multiple such knobs on the system accessible through subdirs, but If I am not mistaken in the current PR all are controlled in the first plugin instance. I think better would be to add support per device which would allow setting all the knobs individually, e.g.:
[cpus_group1]
devices=cpu1,cpu2,cpu3,...
type=cpu
uncore_max_delta_mhz= ${VALUE1}
[cpus_group2]
devices=cpu16,cpu17,cpu18,...
type=cpu
uncore_max_delta_mhz= ${VALUE2}
- Why is it called
uncore_max_delta_mhz? For no confusion wouldn't be better to use the same name and units as are used in the kernelsysfs? - I think verification should be also supported in the code for the
tuned-adm verifyto work.
Hi, I would like to move this PR forward.
IIUC we have two separate issues here, first is adding uncore frequency knob to cpu plugin, and second incorporate the knob into existing profiles (via variable.conf). For now I would like to concentrate on the first issue - a, once that done move to the second one.
- from the kernel documentation it seems this tuning knob is per package die, thus I expect there could be multiple such knobs on the system accessible through subdirs, but If I am not mistaken in the current PR all are controlled in the first plugin instance. I think better would be to add support per device which would allow setting all the knobs individually, e.g.:
[cpus_group1] devices=cpu1,cpu2,cpu3,... type=cpu uncore_max_delta_mhz= ${VALUE1} [cpus_group2] devices=cpu16,cpu17,cpu18,... type=cpu uncore_max_delta_mhz= ${VALUE2}
It's of cource reasonable to configure this per cpu. The one problem I see is that the all cpu's in the config might not necessary located in the same die. Basically will be required that devices='cpu list' is configured to die. That's ok I think , we can log error and make verify fail , if the configuration is not correct.
- Why is it called
uncore_max_delta_mhz? For no confusion wouldn't be better to use the same name and units as are used in the kernelsysfs?
Yes, it would be better to have the same units obviously. Regarding name it's delta, because we don't know apriori what is the maximum frequency. We read that value from sysfs intial_max_freq_khz and subtract the delta and write to max_freq_khz.
Does this rely on the intel_uncore kmod?
Yes.
Will this allow similar functionality in tuned like we've been able to use in the past via the msr-tools package, specifically the binaries
rdmsrandwrmsruser space applications? In some cases the uncore frequency is set in server platform BIOS and has no knobs to control / modify without modifying registers, which isn't user friendly.
Yes, I think using sysfs intel_uncore_frequency would be preferred to configure uncore frequency over rdmsr/wrmsr.
- from the kernel documentation it seems this tuning knob is per package die, thus I expect there could be multiple such knobs on the system accessible through subdirs, but If I am not mistaken in the current PR all are controlled in the first plugin instance. I think better would be to add support per device which would allow setting all the knobs individually, e.g.:
[cpus_group1] devices=cpu1,cpu2,cpu3,... type=cpu uncore_max_delta_mhz= ${VALUE1} [cpus_group2] devices=cpu16,cpu17,cpu18,... type=cpu uncore_max_delta_mhz= ${VALUE2}It's of cource reasonable to configure this per cpu. The one problem I see is that the all cpu's in the config might not necessary located in the same die. Basically will be required that devices='cpu list' is configured to die. That's ok I think , we can log error and make verify fail , if the configuration is not correct.
According to the kernel documentation (https://docs.kernel.org/admin-guide/pm/intel_uncore_frequency_scaling.html) this is configured per package and die combination - not per CPU. So I don't see how or why we would try to configure this for specific CPUs in tuned.
- Why is it called
uncore_max_delta_mhz? For no confusion wouldn't be better to use the same name and units as are used in the kernelsysfs?Yes, it would be better to have the same units obviously. Regarding name it's delta, because we don't know apriori what is the maximum frequency. We read that value from sysfs intial_max_freq_khz and subtract the delta and write to max_freq_khz.
I believe that in at least some cases, either Intel or the HW vendor is going to require a specific uncore frequency to be set. I think tuned needs to allow the uncore max_freq_khz and min_freq_khz to be set specifically - not using deltas. Or perhaps we should allow either method to be used?
According to the kernel documentation (https://docs.kernel.org/admin-guide/pm/intel_uncore_frequency_scaling.html) this is configured per package and die combination - not per CPU. So I don't see how or why we would try to configure this for specific CPUs in tuned.
We know about topology, the die_id can be read from sysfs i.e for cpu6 is: /sys/devices/system/cpu/cpu6/topology/die_id
However there are uncore's that might not contain cpu's (there are uncore* entries additionally to package_die entries in . /sys/devices/system/cpu/intel_uncore_frequency). So this is not that simple. One global variable that control all uncore's as provided this PR is the simplest solution.
I believe that in at least some cases, either Intel or the HW vendor is going to require a specific uncore frequency to be set. I think tuned needs to allow the uncore max_freq_khz and min_freq_khz to be set specifically - not using deltas. Or perhaps we should allow either method to be used?
I'm open to suggestions here. It can be direct value or percentage of max for example. However direct value would not be portable from system to system, hence I think both configuration methods indeed should be provided (direct and percentage/delta)
I've opened another PR . Please check it out to see if it goes to the right direction. It allows to define uncore freq per cpu (the config is checked against topology) . Hope it's something reasonable. Otherwise what I think could be done is one global option (as in this PR) or separate intel_uncore plugin where device is defined as entry in /sys/devices/system/cpu/intel_uncore_frequency/* i.e. package_NN_die_MM or uncoreNN .
Uncore support has been merged, if you want the setting in the throughput-performance profile, please update this PR or create new one.