tuned
tuned copied to clipboard
Tuning of individual interrupts
To optimize latencies in realtime applications, we have a need to direct certain interrupts to certain cpus running our realtime processes. In addition, the cpu allocation happens dynamically (e.g., applications deployed as containers).
There's two parts to this:
- Handle interrupts individually: current implementation of the
scheduler
plugin can only set affinities of ALL interrupts. - Set the affinities dynamically: current implementation requires the affinity to be set statically in the profile.
This PoC
- Adds a new plugin
irq
, which treats interrupts as "devices", so we can have multiple plugin instances, each with different affinities, and each tuning different irqs. - Adds functions to dynamically create and destroy plugin instances. This way, when a new application is started, we can create an instance of the irq plugin with the desired affinity, and then use the existing
instance_acquire_devices
to apply the selected affinity to individual interrupts.
Example, on a 4-core machine, with an irq
section in the profile:
[irq]
affinity=0
And then dynamically changing the affinity of irq16:
# busctl call com.redhat.tuned /Tuned com.redhat.tuned.control instance_create ssa{ss} irq myinstance 1 affinity 2
bs true "OK"
# busctl call com.redhat.tuned /Tuned com.redhat.tuned.control instance_acquire_devices ss irq16 myinstance
(bs) true "OK"
And in the log:
2023-12-08 15:16:05,638 INFO tuned.daemon.controller: Created dynamic instance 'myinstance' of plugin 'irq'
2023-12-08 15:16:29,263 INFO tuned.daemon.controller: Moving devices '{'irq16'}' from instance 'irq' to instance 'myinstance'.
2023-12-08 15:16:29,263 INFO tuned.plugins.plugin_irq: _instance_unapply_dynamic(irq, irq16)
2023-12-08 15:16:29,264 INFO tuned.plugins.plugin_irq: Setting SMP affinity of IRQ 16 to '0000000f'
2023-12-08 15:16:29,264 INFO tuned.plugins.hotplug: instance myinstance: adding new device irq16
2023-12-08 15:16:29,265 INFO tuned.plugins.plugin_irq: _instance_apply_dynamic(myinstance, irq16)
2023-12-08 15:16:29,265 INFO tuned.plugins.plugin_irq: Setting SMP affinity of IRQ 16 to '00000004'
Open questions
- What happens to the irq affinity feature of the
scheduler
plugin? For backwards-compatibility I guess we'll need to keep the feature and a switch to be able to tellscheduler
to not touch interrupt affinities in case we want to use the new plugin? - Can we move devices between plugin instances without running "unapply"? In my example above the interrupt affinity briefly gets restored to 0xf before the new desired value is applied...
- What happens when TuneD is restarted? Currently, any dynamic instances would be lost. But currently, also any device assignments that have been made using
instance_aquire_devices
are lost, right?
@yarda Any thoughts on this? Do you think a feature like this could be accepted?
/packit build
Hi @adriaan42 this is interesting RFE.
Open questions
* What happens to the irq affinity feature of the `scheduler` plugin? For backwards-compatibility I guess we'll need to keep the feature and a switch to be able to tell `scheduler` to not touch interrupt affinities in case we want to use the new plugin?
We are already trying to address this https://issues.redhat.com/browse/RHEL-21923
* Can we move devices between plugin instances without running "unapply"? In my example above the interrupt affinity briefly gets restored to 0xf before the new desired value is applied...
This is worth investigation and I was already thinking about it, because the "unapply" can be costly and can cause various interference. Generally, we need it, because one instance can change things which the other instance doesn't touch and wouldn't be then rolled back. Maybe we could handle it by more intelligent e.g. "switch" method that would combine "apply" and "unapply". I.e. it could compare the options and roll back only those which are not handled by the second instance. For those which are handled by both instances the second instance could just take the saved previous values from the first instance.
* What happens when TuneD is restarted? Currently, any dynamic instances would be lost. But currently, also any device assignments that have been made using `instance_aquire_devices` are lost, right?
Yes, I think it's OK. Usually it maybe useful to start-over with the clear stock profile. If it is required to preserve the state, I think we could add API call to e.g. dump the profile into /run/tuned and make the /run/tuned preference over /etc/tuned.
Please fix centos-7 CI failures. We are going to drop python 2.7 support once RHEL-7 is retired.
...
Compiling /builddir/build/BUILDROOT/tuned-2.21.0-1.20240118103851136139.pr580.33.ga7d8364.el7.x86_64/usr/lib/python2.7/site-packages/tuned/plugins/plugin_irq.py ...
File "/usr/lib/python2.7/site-packages/tuned/plugins/plugin_irq.py", line 33
log.info(f"_instance_init({instance.name})")
^
SyntaxError: invalid syntax
...
For better backward compatibility I would drop 7799dd2b4bafcbc23f677145ef5f9266ec5657df and rely on fix from the https://issues.redhat.com/browse/RHEL-21923 and would just change the TuneD profiles where needed.
Maybe the instance_create/destroy could be more generic, i.e.:
- instance_create could have a list of devices or regular expressions to match devices to immediately try to hotplug. Watchout for the inconsistency that empty list of devices is now handled as a glob "*". Maybe we could use some special keyword for it, like e.g. "EMPTY" or "none" which could be also used for static instance definition in the TuneD profile.
- instance_destroy could destroy all instances not only the dynamically created, i.e. statically created instances with no devices and in case there are some devices attached, it could try to release them and hotplug (like the hotplug code does).
Or maybe for simple PoC just allow destroying instances with no devices and do not distinguish between statically and dynamically created instances, i.e. if there are no devices it can be destroyed.
PR updates
- Changed scheduler plugin according to https://issues.redhat.com/browse/RHEL-21923
- Added tuning of the default irq affinity
- Removed (hopefully) all code that doesn't work on Python2.7
- other changes commented above
Some open points
-
Deleting instances: I have removed the
_created_dynamically
flag and respective checks, so also non-dynamically-created instances can be deleted. Should there be rules here? E.g., what happens if all instances of a plugin are deleted? - Behavior of dynamic instance creation/deletion, and using hotplug: I kept everything explicit and manual for now. At first glance, using hotplug could make this a little harder to use correctly, to avoid accidentally having and irq assigned to the wrong instance. E.g., if we say when an instance with assigned devices is deleted, we just automatically hotplug the devices to other instances -> What happens if there is more than one suitable instance?
- How do we want to handle (physical) hotplugging of devices, which leads to adding/removal of IRQs in the system? Do we need to periodically scan
/proc/irq
and adapt the devices known to the plugin? - Moving devices between plugin instances without running "unapply": That is certainly interesting, and I have some ideas, but I think that's for a future PR.
- Preserving state across daemon restarts: In general, I think a daemon restart (without changing the profile) should not have an impact on current tuning, so instances and attached devices should be persistent. But this is a bigger topic, and won't be addressed in this PR.
And one more thing
The goal of this is to be able to tune a device (usually a NIC) to be used exclusively and with low latency by a realtime application. The IRQ affinity is just one part of that. There is also:
- IRQ threads (kernel threads
irq/*
). The kernel automatically migrates them when the IRQ affinity is changed, so we don't need to touch their affinities manually, and in case we need to set their priority I believe we can do that with the scheduler plugin. - NAPI threads (kernel threads
napi/*
, created for a NIC, depending on the setting in/sys/class/net/%iface%/threaded
). Those are createdSCHED_OTHER
, and we can use the scheduler plugin to give them a suitable policy and priority. Their default affinity is "all cores", and currently we could use the scheduler plugin to set a static affinity for them. But we currently can't dynamically set their affinity. - Maybe other/future kernel threads that we'd also like to migrate dynamically.
This maybe goes beyond the scope if an "irq plugin", but logically it could fit here. But I haven't thought it through, especially the conflicts with the scheduler plugin. Maybe you have an opinion/idea?
So far I tested the new irq_process
option in the scheduler
plugin and it seems to work nice. When doing the final rebase/squash, you should probably add Resolves: RHEL-21923
into the commit message.
I have removed the _created_dynamically flag and respective checks, so also non-dynamically-created instances can be deleted. Should there be rules here? E.g., what happens if all instances of a plugin are deleted?
I can't think of a reason why deleting the last plugin instance should be disallowed. Technically, we could also permit creating instances of inactive plugins via the create_instance
command (though I'm not sure whether that has a real-life use case).
E.g., if we say when an instance with assigned devices is deleted, we just automatically hotplug the devices to other instances -> What happens if there is more than one suitable instance?
IIUC the hotplug implementation currently assigns new devices to suitable plugin instances based on their priority
option, as sorted here, so maybe that code could be re-used (and possibly overrided in the future if we want custom behavior in specific plugins). Although to be able to set the priority of newly created instances, the code would need to be refactored a bit.
Fedora rawhide CI failure is caused by regression in the kernel: https://bugzilla.redhat.com/show_bug.cgi?id=2262526
@adriaan42 please fix the CodeQL found problems.
IMHO: dee5d6f63aab46427ade7415babf3c5ef1488c52 - it's OK for merge now aad02c76e152bed9aea7a34846a013278db973f8 - tighten the policy (per inline comment) and it's probably OK 17b528c0d2065ff902d6ceb00407569c2a8f2c05 - this still needs some work, see comments
Some open points
- Deleting instances: I have removed the
_created_dynamically
flag and respective checks, so also non-dynamically-created instances can be deleted. Should there be rules here? E.g., what happens if all instances of a plugin are deleted?
As @zacikpa wrote I don't think we need any limitation here. Currently, at runtime I think you won't be able to achieve it, because through the API you can only move devices between instances, thus there will be always at least one instance with some devices which cannot be removed. If you startup with empty instances then you will be able to delete all instances, but I think it shouldn't cause any problem.
- How do we want to handle (physical) hotplugging of devices, which leads to adding/removal of IRQs in the system? Do we need to periodically scan
/proc/irq
and adapt the devices known to the plugin?
This is usually done by udev. It seems udev can parse the irqs:
udevadm info /sys/kernel/irq/59
P: /kernel/irq/59
M: 59
But I don't know whether it support hotplug for irqs. If yes we are probably done just by switching to udev enumeration. If not, we probably cannot support hotplug of irqs without periodic rescans and in such case I wouldn't cope with it for now.
- Moving devices between plugin instances without running "unapply": That is certainly interesting, and I have some ideas, but I think that's for a future PR.
OK, I agree.
- Preserving state across daemon restarts: In general, I think a daemon restart (without changing the profile) should not have an impact on current tuning, so instances and attached devices should be persistent. But this is a bigger topic, and won't be addressed in this PR.
I agree.
And one more thing
The goal of this is to be able to tune a device (usually a NIC) to be used exclusively and with low latency by a realtime application. The IRQ affinity is just one part of that. There is also:
IRQ threads (kernel threads
irq/*
). The kernel automatically migrates them when the IRQ affinity is changed, so we don't need to touch their affinities manually, and in case we need to set their priority I believe we can do that with the scheduler plugin.NAPI threads (kernel threads
napi/*
, created for a NIC, depending on the setting in/sys/class/net/%iface%/threaded
). Those are createdSCHED_OTHER
, and we can use the scheduler plugin to give them a suitable policy and priority. Their default affinity is "all cores", and currently we could use the scheduler plugin to set a static affinity for them. But we currently can't dynamically set their affinity.Maybe other/future kernel threads that we'd also like to migrate dynamically.
This maybe goes beyond the scope if an "irq plugin", but logically it could fit here. But I haven't thought it through, especially the conflicts with the scheduler plugin. Maybe you have an opinion/idea?
I think we can duplicate the functionality in plugins if necessary and cover the limitations in documentation. Otherwise I don't have more ideas at the moment.
IMHO: dee5d6f - it's OK for merge now aad02c7 - tighten the policy (per inline comment) and it's probably OK 17b528c - this still needs some work, see comments
Maybe could you split the PR? The first commit is ready, the second almost ready and we would like to have them in the upcoming release which will happen very soon. The last commit (irq plugin) may take some more time.
IMHO: dee5d6f - it's OK for merge now aad02c7 - tighten the policy (per inline comment) and it's probably OK 17b528c - this still needs some work, see comments
Maybe could you split the PR? The first commit is ready, the second almost ready and we would like to have them in the upcoming release which will happen very soon. The last commit (irq plugin) may take some more time.
Yes, the commits are pretty independent and I'm happy to split the PR. I agree that the first one is ready. For the second one, I'm not sure: this currently does not consider hotplugging, and with hotplugging we might need additional parameters for the instance_create
call, like a devices_udev_regex
and an instance priority
. It looks like the priority can be passed as part of the options
, but I'm not sure about the udev regex. That seems to be a separate argument to the instance creation.
I'll open independent PRs and we can discuss there...
IMHO: dee5d6f - it's OK for merge now aad02c7 - tighten the policy (per inline comment) and it's probably OK 17b528c - this still needs some work, see comments
Maybe could you split the PR? The first commit is ready, the second almost ready and we would like to have them in the upcoming release which will happen very soon. The last commit (irq plugin) may take some more time.
Yes, the commits are pretty independent and I'm happy to split the PR. I agree that the first one is ready. For the second one, I'm not sure: this currently does not consider hotplugging, and with hotplugging we might need additional parameters for the
instance_create
call, like adevices_udev_regex
and an instancepriority
. It looks like the priority can be passed as part of theoptions
, but I'm not sure about the udev regex. That seems to be a separate argument to the instance creation.I'll open independent PRs and we can discuss there...
Thanks. Regarding the second PR I thought it's good enough for the basic functionality and could be improved later. But if you want to do it right +1 from me.
Thanks. Regarding the second PR I thought it's good enough for the basic functionality and could be improved later. But if you want to do it right +1 from me.
I was worried that we might need to change the dbus interface later to support hotplugging. But I had another look, and I believe we could still pass all required argument in options
via dbus, and can then extract them just like the Unit
constructor (https://github.com/redhat-performance/tuned/blob/master/tuned/profiles/unit.py#L12).
So if it's ok for you, we can already use it and refine later. I opened #596, with the updated policies, and a NOTE
in the code about the current shortcomings.
Thanks. Regarding the second PR I thought it's good enough for the basic functionality and could be improved later. But if you want to do it right +1 from me.
I was worried that we might need to change the dbus interface later to support hotplugging. But I had another look, and I believe we could still pass all required argument in
options
via dbus, and can then extract them just like theUnit
constructor (https://github.com/redhat-performance/tuned/blob/master/tuned/profiles/unit.py#L12).So if it's ok for you, we can already use it and refine later. I opened #596, with the updated policies, and a
NOTE
in the code about the current shortcomings.
Thanks for the split. Currently, only the first commit is important for the upcoming release. The second commit is nice to have and I think it can wait, so I will keep it open.
- How do we want to handle (physical) hotplugging of devices, which leads to adding/removal of IRQs in the system? Do we need to periodically scan
/proc/irq
and adapt the devices known to the plugin?This is usually done by udev. It seems udev can parse the irqs:
udevadm info /sys/kernel/irq/59 P: /kernel/irq/59 M: 59
But I don't know whether it support hotplug for irqs. If yes we are probably done just by switching to udev enumeration. If not, we probably cannot support hotplug of irqs without periodic rescans and in such case I wouldn't cope with it for now.
Hmm, I am afraid this will not work with udev, I tried simple reproducer:
import pyudev
udev_context = pyudev.Context()
for d in udev_context.list_devices():
print(d)
Unfortunately, it seems it doesn't know the IRQs.
Another PR update:
- removed the "dynamic" part from this PR, as this is independent and has been moved to #596
- tried to addess most of the review comments
- code cleanup, documentation
I'll try to summarize the previous discussion and open topics:
- Determine the "first instance": The way as implemented in the
cpu
plugin does not work (basically the same issue I mentioned in https://github.com/redhat-performance/tuned/pull/366#issuecomment-1738633828):- I'd like to be able to have a configuration like this:
[irq_default] type=irq affinity=0 default_affinity=0 devices=* [irq_special] type=irq priority=-1 affinity=2 devices=irq42
- So there is a default instance that should handle global/static tuning and all devices not matching any other instance, and then dedicated instance for exceptions.
- The current implementation in the CPU plugin (which is the only other plugin with a "first instance") always takes the instance with the highest priority (lowest value) as "first instance", but by doing that, this instance is started first, and automatically claims all free devices.
- In my current proposal here, I determine the "first instance" by name (
instance._is_main_instance = instance.name == instance.plugin.name
), but I'm open to other suggestions, e.g. an explicit option (first_instance=True
)...
- I'd like to be able to have a configuration like this:
-
command_custom
vs.get/set
: I somehow like the custom variant a little better, because it's all in one function and the interface seems a little cleaner...
Sorry for the delay, but hopefully it's not urgent, although it's very nice RFE.
* Determine the "first instance": The way as implemented in the `cpu` plugin does not work (basically the same issue I mentioned in [cpu: add disable_thread_siblings option #366 (comment)](https://github.com/redhat-performance/tuned/pull/366#issuecomment-1738633828)): * I'd like to be able to have a configuration like this: ```ini [irq_default] type=irq affinity=0 default_affinity=0 devices=* [irq_special] type=irq priority=-1 affinity=2 devices=irq42 ``` * So there is a default instance that should handle global/static tuning and all devices not matching any other instance, and then dedicated instance for exceptions.
You can define the priority also with the order change, e.g.:
[irq_special]
type=irq
affinity=2
devices=irq42
[irq_default]
type=irq
affinity=0
default_affinity=0
devices=*
* The current implementation in the CPU plugin (which is the only other plugin with a "first instance") always takes the instance with the highest priority (lowest value) as "first instance", but by doing that, this instance is started first, and automatically claims all free devices. * In my current proposal here, I determine the "first instance" by name (`instance._is_main_instance = instance.name == instance.plugin.name`), but I'm open to other suggestions, e.g. an explicit option (`first_instance=True`)...
Do we need the first instance here? Although your approach with the first instance being the instance with the plugin name works, it's a new TuneD paradigm. What about using special device name, e.g. "DEFAULT", then:
[irq_default]
type=irq
affinity=0
devices=DEFAULT
I.e. handle the default SMP affinity as a special IRQ, then it could be matched by the glob '*', so you could write:
[irq_default]
type=irq
affinity=0
To cover all remaining IRQs including the default setting (implicit devices=*
). In case you don't want the default SMP, which I think you usually want:
[irq_nondefault]
type=irq
affinity=0
devices=!DEFAULT
Then I think you could also use the calc
parameter in the affinity
which could simplify things. What do you think?
Do we need the first instance here? Although your approach with the first instance being the instance with the plugin name works, it's a new TuneD paradigm. What about using special device name, e.g. "DEFAULT", then:
Good idea! Yes, that works and actually simplifies the code.
Then I think you could also use the
calc
parameter in theaffinity
which could simplify things. What do you think?
I'm not entirely sure how to handle the special values ignore
/calc
present in the scheduler plugin:
- We no longer need
ignore
. An empty affinity list means "don't touch". -
calc
means "derive affinity fromisolated_cores
". We don't have that variable in the new plugin, and the user now needs to calculate the desired set explicitly. I think that's ok, and there already are functions available to do this (e.g.cpulist_invert
). -
calc
also means "intersect the desired affinity with the current one". I assume that's so TuneD does not undo affinity settings made by someone else? Is that still needed? If so, then we could introduce a parametermode
which can beset
, to always apply the affinity value (default), orintersect
to calculate the new value based on the current one.
would then become[scheduler] isolated_cores=${isolcpus} default_irq_smp_affinity=calc
[irq] affinity=${f:cpulist_invert:${isolcpus}} devices=!default [irqdefault] affinity=${f:cpulist_invert:${isolcpus}} devices=default mode=intersect
* We no longer need `ignore`. An empty affinity list means "don't touch".
OK.
* `calc` means "derive affinity from `isolated_cores`". We don't have that variable in the new plugin, and the user now needs to calculate the desired set explicitly. I think that's ok, and there already are functions available to do this (e.g. `cpulist_invert`).
OK, it's always good to use what's currently available.
* `calc` also means "intersect the desired affinity with the current one". I assume that's so TuneD does not undo affinity settings made by someone else? Is that still needed? If so, then we could introduce a parameter `mode` which can be `set`, to always apply the affinity value (default), or `intersect` to calculate the new value based on the current one.
I think mode
is OK, I think it's the most flexible approach.
```ini [scheduler] isolated_cores=${isolcpus} default_irq_smp_affinity=calc ``` would then become ```ini [irq] affinity=${f:cpulist_invert:${isolcpus}} devices=!default [irqdefault] affinity=${f:cpulist_invert:${isolcpus}} devices=default mode=intersect ```
I think this can be even simplified to:
[irqdefault]
affinity=${f:cpulist_invert:${isolcpus}}
devices=default
mode=intersect
[irq]
affinity=${f:cpulist_invert:${isolcpus}}
Maybe I would use uppercase name i.e. DEFAULT
, as it is some special (virtual) device. IIRC some special values were previously defined in uppercase in TuneD and IMHO there is also lower probability of the name conflict with some existing device name. But if you like the lowercase variant more, I am not against.
Pushed updated implementation:
-
default_smp_affinity
is now handled by a virtual deviceDEFAULT
, eliminating the need for a "first/main instance". - instances have a
mode
option:set
always applies the configured affinity,intersect
calculates the new affinity as intersection of the current and the configured one, providing behavior of thecalc
setting of the scheduler plugin.
Thanks, LGTM.