tuned icon indicating copy to clipboard operation
tuned copied to clipboard

cpu: add disable_thread_siblings option

Open adriaan42 opened this issue 3 years ago • 16 comments

This disables the thread siblings (a.k.a Hyperthreads) of the selected CPUs, which prevents any impact due to resources shared between the threads.

The selected CPUs will stay online. If the list contains CPUs and their sibling threads, then those siblings will also stay online.

Note that this interacts with the scheduler plugin, because it changes available CPUs that may appear in affinity masks. When using this, [cpu] should appear after [scheduler] in the profile.

Signed-off-by: Adriaan Schmidt [email protected]

adriaan42 avatar Jul 07 '21 13:07 adriaan42

There are scenarios where disabling thread siblings (hyper-threads) is recommended, and we use it in some more aggressively-tuned use cases:

  • https://01.org/packet-processing/blogs/nsundar/2018/nfv-i-host-configuration-low-latency
  • https://rigtorp.se/low-latency-guide/

adriaan42 avatar Jul 07 '21 13:07 adriaan42

@adriaan42 could it wait for the 2.17.0 release? I would like to tag 2.16.0 release today or tmrw (we are quite behind the schedule).

yarda avatar Jul 07 '21 18:07 yarda

@adriaan42 could it wait for the 2.17.0 release? I would like to tag 2.16.0 release today or tmrw (we are quite behind the schedule).

Yes, sure. There's no hurry. I have a long weekend now, will post an update next week.

adriaan42 avatar Jul 07 '21 19:07 adriaan42

Thank you for the PR, @adriaan42 ! This is definitely a useful functionality to add to TuneD. Could you please provide some documentation on how to use this feature to the commit message/description of this PR along with an example? From my own experience, adding documentation helps you think more about the feature you're adding and typically often improves the implementation itself too. Here is an example of what we're working on right now for some inspiration. Your docs doesn't have to be asciidoc, just plain text.

jmencak avatar Jul 21 '21 14:07 jmencak

@jmencak thanks! Yes, I will provide some documentation.

While thinking about the feature some more, a question on the implementation occured: in my current proposal, all you pass is the CPUs whose siblings should be taken offline, so you could say disable_thread_siblings=${isolated_cores}, and at the profile level you never need to know which cores will actually be deactivated.

But offlining CPUs affects other code that deals with affinities, and I see some interactions with the scheduler plugin, where the ordering of setting affinities vs. offlining CPUs matters.

An alternative would be to expose the thread sibling logic to the profile, e.g. providing something like a thread_siblings_of(cpu_set) function. This way the profiles could calculate CPU sets explicitly, and remove the deactivated cores from the affinity sets (currently represented by the profile variable non_isolated_cores). After the profile has calculated the set of cores to disable, it would then do something like disable_cpus=${cores_to_offline} (where disable_cpus no longer needs any thread sibling logic). This would mean more code, and some effort in the profiles using the feature, but may be the cleaner approach.

What do you think?

adriaan42 avatar Jul 22 '21 11:07 adriaan42

While thinking about the feature some more, a question on the implementation occured: in my current proposal, all you pass is the CPUs whose siblings should be taken offline, so you could say disable_thread_siblings=${isolated_cores}, and at the profile level you never need to know which cores will actually be deactivated.

Sounds good. Traditionally ${isolated_cores} were used in the sense of not housekeeping cores. I.e. the cores for the latency-sensitive applications to run on. This PR will enable better granularity for selectively disabling HT for a subset of selected cores only, rather than using the kernel command-line option noht.

But offlining CPUs affects other code that deals with affinities, and I see some interactions with the scheduler plugin, where the ordering of setting affinities vs. offlining CPUs matters.

Aah, possibly.

An alternative would be to expose the thread sibling logic to the profile, e.g. providing something like a thread_siblings_of(cpu_set) function. This way the profiles could calculate CPU sets explicitly, and remove the deactivated cores from the affinity sets (currently represented by the profile variable non_isolated_cores). After the profile has calculated the set of cores to disable, it would then do something like disable_cpus=${cores_to_offline} (where disable_cpus no longer needs any thread sibling logic). This would mean more code, and some effort in the profiles using the feature, but may be the cleaner approach.

Yes, this sounds "cleaner" and probably even more flexible than the existing solution, so what you suggest sounds reasonable to me. Perhaps we need a proof-of-concept, a few examples (docs) and a bit of testing to answer whether this is really the way to go? Other folk (especially TuneD maintainers) might have an opinion on this already too?

jmencak avatar Jul 23 '21 11:07 jmencak

This pull request introduces 2 alerts when merging a77422c7fa5513a5fab00f2dc773d887197d77df into 7ba8d5d694267783466d1700ca4c1a09a5610aec - view on LGTM.com

new alerts:

  • 2 for Unused import

lgtm-com[bot] avatar Jul 26 '21 12:07 lgtm-com[bot]

I've just pushed version 2 of my proposal, which splits functionality into two parts:

  • the function cpulist_thread_siblings is used to explicitly calculate the CPUs that should be disabled
  • the option disable_cpus of the cpu plugin disables them (could rename this to disable_cores which may be a more consistent name...)

So, in a profile, we can now:

[variables]
isolated = <insert here>
siblings_of_isolated = ${f:cpulist_thread_siblings:${isolated}}
isolated_or_disabled = ${isolated},${siblings_of_isolated}
non_isolated = ${f:cpulist_invert:${isolated_or_disabled}}
[bootloader]
cmdline = isolcpus=${isolated} rcu_nocbs=${isolated} irqaffinity=${non_isolated}
# or even ${isolated_or_disabled} for isolcpus and rcu_nocbs, not sure which is better
[cpu]
disable_cpus = ${siblings_of_isolated}
[scheduler]
isolated_cores = ${isolated}

Or construct CPU sets differently if we want some core isolated with their siblings disabled, and some with siblings enabled, etc.

adriaan42 avatar Jul 26 '21 12:07 adriaan42

Many thanks for the updated PR! I've tested the current version and it seems to work fine. Swapping the [cpu] and [scheduler] sections also worked fine. Even though disable_cores would be consistent with isolated_cores, I personally wouldn't rename disable_cpus to disable_cores. isolated_cores was an unfortunate past decision which should in fact have been isolated_cpus.

As for the code review, leaving this up to the TuneD maintainers.

jmencak avatar Jul 26 '21 14:07 jmencak

@yarda could you please have a look at the updated implementation?

adriaan42 avatar Sep 14 '21 08:09 adriaan42

@yarda , I believe it is now a good time to revisit this as 2.20 is out, WDYT? It's been some time... If you want, I can give it some testing from my side.

jmencak avatar Feb 21 '23 08:02 jmencak

I think we need to extend the ADOC documentation in the class CPULatencyPlugin by the new option.

I am afraid this PR breaks the current logic a bit. There can be multiple instances of the cpu plugin, thus I think such changes should be made only in the "first instance" of the cpu plugin. Moreover disablement of the CPUs may affect other cpu plugin instances, thus reenumeration or hotplug/deplug event initiation will be probably required.

Maybe we could just add the "offline" or similar option to the cpu plugin? E.g.:

[cpus_online]
type=cpu
devices=${f:cpulist2devs:-0-3}
...

[cpus_offline]
type=cpu
devices=${f:cpulist2devs:-4-7}
offline=true

I think this would be more compatible with the current logic and you could also add the function for calculation of the siblings (e.g. cpulist_thread_siblings(0-3) to get siblings of CPUs 0, 1, 2, 3) and disable them all just in the one cpu plugin instance, e.g. called [cpus_offline].

yarda avatar Mar 08 '23 19:03 yarda

I think we need to extend the ADOC documentation in the class CPULatencyPlugin by the new option.

I am afraid this PR breaks the current logic a bit. There can be multiple instances of the cpu plugin, thus I think such changes should be made only in the "first instance" of the cpu plugin. Moreover disablement of the CPUs may affect other cpu plugin instances, thus reenumeration or hotplug/deplug event initiation will be probably required.

Actually, the current implementation already ensures the disabling only runs in the first instance (it happens after the if not instance._first_instance: return in _instance_apply_static). To reenumerate or trigger hotplug/deplug I'd need some help...

Maybe we could just add the "offline" or similar option to the cpu plugin? E.g.:

[cpus_online]
type=cpu
devices=${f:cpulist2devs:-0-3}
...

[cpus_offline]
type=cpu
devices=${f:cpulist2devs:-4-7}
offline=true

I think this would be more compatible with the current logic and you could also add the function for calculation of the siblings (e.g. cpulist_thread_siblings(0-3) to get siblings of CPUs 0, 1, 2, 3) and disable them all just in the one cpu plugin instance, e.g. called [cpus_offline].

Yes, I've just played around with it, and it can be implemented like this. Just one thing that makes this tricky to use: The cpus_online instance needs to come first, because things like latency settings are only applied for the first instance. That means that we'd need to explicitly calculate the online-cpuset. We can't just define the CPUs we want to offline in cpus_offline, and let the cpus_online instance "catch" the remaining devices. But that's ok for me.

@yarda just let me know which way you'd like the implementation, then I can send an update (with documentation).

adriaan42 avatar Apr 13 '23 17:04 adriaan42

@adriaan42 I like the 'offline' option more. It's more the TuneD way.

It's also possible to explicitly define order of the instances by using the 'priority' option, otherwise it should be processed the way it's defined, e.g.:

[cpu_offline]
type=cpu
devices=....
priority=10

[cpu_online]
type=cpu
devices=...
priority=5

And cpu_online will be always processes as the first.

There is now also the 'pm_qos_resume_latency_us' option which works per CPU, e.g.:

[cpu_online]
type=cpu
devices=...
pm_qos_resume_latency_us=100

[cpu_offline]
type=cpu
devices=....
pm_qos_resume_latency_us=150

yarda avatar Apr 13 '23 21:04 yarda

@yarda I've just pushed an update. In the code, I'm doing a whole dance with _storage_get and _storage_set to track which CPUs need to be re-enabled. Is this actually the correct way to do it?

adriaan42 avatar May 04 '23 13:05 adriaan42

Rebased on current master, and also fixed a bug in _instance_verify_static.

@yarda I'm still seeing a potential issue with the usability due to the "first instance" vs "implicit assignment of devices":

If I have a system with cpus 0-3, and want to take cpu3 offline, I'd want to:

[cpu_online]
type=cpu
[cpu_offline]
type=cpu
devices=${f:cpulist2devs:3}
offline=true

But [cpu_online] will now claim all devices:

2023-09-28 09:24:56,875 DEBUG    tuned.units.manager: creating 'cpu_online' (cpu)
2023-09-28 09:24:56,875 DEBUG    tuned.units.manager: creating 'cpu_offline' (cpu)
2023-09-28 09:24:56,876 DEBUG    tuned.plugins.base: assigning devices to instance cpu_online
2023-09-28 09:24:56,877 INFO     tuned.plugins.base: instance cpu: assigning devices cpu2, cpu0, cpu3, cpu1
2023-09-28 09:24:56,877 DEBUG    tuned.plugins.base: initializing instance cpu_online (cpu)
[...]
2023-09-28 09:24:56,881 DEBUG    tuned.plugins.base: assigning devices to instance cpu_offline
2023-09-28 09:24:56,881 WARNING  tuned.plugins.base: instance cpu_offline: no matching devices available
2023-09-28 09:24:56,881 DEBUG    tuned.plugins.base: initializing instance cpu_offline (cpu)
2023-09-28 09:24:56,881 INFO     tuned.plugins.plugin_cpu: Latency settings from non-first CPU plugin instance 'cpu_offline' will be ignored.

Changing the order (either in the file or with priorities) results in the correct device assignements, but the wrong "first instance":

2023-09-28 09:27:51,475 DEBUG    tuned.units.manager: creating 'cpu_offline' (cpu)
2023-09-28 09:27:51,475 DEBUG    tuned.units.manager: creating 'cpu_online' (cpu)
2023-09-28 09:27:51,476 DEBUG    tuned.plugins.base: assigning devices to instance cpu_offline
2023-09-28 09:27:51,477 INFO     tuned.plugins.base: instance cpu_offline (cpu): assigning devices cpu3
2023-09-28 09:27:51,477 DEBUG    tuned.plugins.base: initializing instance cpu_offline (cpu)
[...]
2023-09-28 09:27:51,480 DEBUG    tuned.plugins.base: assigning devices to instance cpu_online
2023-09-28 09:27:51,480 INFO     tuned.plugins.base: instance cpu_online (cpu): assigning devices cpu1, cpu0, cpu2
2023-09-28 09:27:51,481 DEBUG    tuned.plugins.base: initializing instance cpu_online (cpu)
2023-09-28 09:27:51,481 INFO     tuned.plugins.plugin_cpu: Latency settings from non-first CPU plugin instance 'cpu_online' will be ignored.

To make it work a user needs to explicitly specify also the online cpus:

[cpu_online]
type=cpu
devices=${f:cpulist2devs:0-2}
[cpu_offline]
type=cpu
devices=${f:cpulist2devs:3}
offline=true

Side-question: is this behavior in the device logic (an instance without explicit devices= setting will claim all available devices, even those devices that are explicitly requested by another/later instance) intended? I would somehow expect any explicit devices= specification to have priority over the catch-all when no devices are specified, regardless of instance priority.

adriaan42 avatar Sep 28 '23 07:09 adriaan42