tuned Tuning of individual interrupts

To optimize latencies in realtime applications, we have a need to direct certain interrupts to certain cpus running our realtime processes. In addition, the cpu allocation happens dynamically (e.g., applications deployed as containers).

There's two parts to this:

Handle interrupts individually: current implementation of the scheduler plugin can only set affinities of ALL interrupts.
Set the affinities dynamically: current implementation requires the affinity to be set statically in the profile.

This PoC

Adds a new plugin irq, which treats interrupts as "devices", so we can have multiple plugin instances, each with different affinities, and each tuning different irqs.
Adds functions to dynamically create and destroy plugin instances. This way, when a new application is started, we can create an instance of the irq plugin with the desired affinity, and then use the existing instance_acquire_devices to apply the selected affinity to individual interrupts.

Example, on a 4-core machine, with an irq section in the profile:

[irq]
affinity=0

And then dynamically changing the affinity of irq16:

# busctl call com.redhat.tuned /Tuned com.redhat.tuned.control instance_create ssa{ss} irq myinstance 1 affinity 2
bs true "OK"
# busctl call com.redhat.tuned /Tuned com.redhat.tuned.control instance_acquire_devices ss irq16 myinstance
(bs) true "OK"

And in the log:

2023-12-08 15:16:05,638 INFO     tuned.daemon.controller: Created dynamic instance 'myinstance' of plugin 'irq'
2023-12-08 15:16:29,263 INFO     tuned.daemon.controller: Moving devices '{'irq16'}' from instance 'irq' to instance 'myinstance'.
2023-12-08 15:16:29,263 INFO     tuned.plugins.plugin_irq: _instance_unapply_dynamic(irq, irq16)
2023-12-08 15:16:29,264 INFO     tuned.plugins.plugin_irq: Setting SMP affinity of IRQ 16 to '0000000f'
2023-12-08 15:16:29,264 INFO     tuned.plugins.hotplug: instance myinstance: adding new device irq16
2023-12-08 15:16:29,265 INFO     tuned.plugins.plugin_irq: _instance_apply_dynamic(myinstance, irq16)
2023-12-08 15:16:29,265 INFO     tuned.plugins.plugin_irq: Setting SMP affinity of IRQ 16 to '00000004'

Open questions

What happens to the irq affinity feature of the scheduler plugin? For backwards-compatibility I guess we'll need to keep the feature and a switch to be able to tell scheduler to not touch interrupt affinities in case we want to use the new plugin?
Can we move devices between plugin instances without running "unapply"? In my example above the interrupt affinity briefly gets restored to 0xf before the new desired value is applied...
What happens when TuneD is restarted? Currently, any dynamic instances would be lost. But currently, also any device assignments that have been made using instance_aquire_devices are lost, right?

Dec 08 '23 14:12 adriaan42

@yarda Any thoughts on this? Do you think a feature like this could be accepted?

Jan 02 '24 13:01 adriaan42

/packit build

Jan 18 '24 10:01 yarda

Hi @adriaan42 this is interesting RFE.

Open questions

* What happens to the irq affinity feature of the `scheduler` plugin? For backwards-compatibility I guess we'll need to keep the feature and a switch to be able to tell `scheduler` to not touch interrupt affinities in case we want to use the new plugin?

We are already trying to address this https://issues.redhat.com/browse/RHEL-21923

* Can we move devices between plugin instances without running "unapply"? In my example above the interrupt affinity briefly gets restored to 0xf before the new desired value is applied...

This is worth investigation and I was already thinking about it, because the "unapply" can be costly and can cause various interference. Generally, we need it, because one instance can change things which the other instance doesn't touch and wouldn't be then rolled back. Maybe we could handle it by more intelligent e.g. "switch" method that would combine "apply" and "unapply". I.e. it could compare the options and roll back only those which are not handled by the second instance. For those which are handled by both instances the second instance could just take the saved previous values from the first instance.

* What happens when TuneD is restarted? Currently, any dynamic instances would be lost. But currently, also any device assignments that have been made using `instance_aquire_devices` are lost, right?

Yes, I think it's OK. Usually it maybe useful to start-over with the clear stock profile. If it is required to preserve the state, I think we could add API call to e.g. dump the profile into /run/tuned and make the /run/tuned preference over /etc/tuned.

Jan 18 '24 11:01 yarda

Please fix centos-7 CI failures. We are going to drop python 2.7 support once RHEL-7 is retired.

...
Compiling /builddir/build/BUILDROOT/tuned-2.21.0-1.20240118103851136139.pr580.33.ga7d8364.el7.x86_64/usr/lib/python2.7/site-packages/tuned/plugins/plugin_irq.py ...
  File "/usr/lib/python2.7/site-packages/tuned/plugins/plugin_irq.py", line 33
    log.info(f"_instance_init({instance.name})")
                                              ^
SyntaxError: invalid syntax
...

Jan 18 '24 12:01 yarda

For better backward compatibility I would drop 7799dd2b4bafcbc23f677145ef5f9266ec5657df and rely on fix from the https://issues.redhat.com/browse/RHEL-21923 and would just change the TuneD profiles where needed.

Jan 18 '24 12:01 yarda

Maybe the instance_create/destroy could be more generic, i.e.:

instance_create could have a list of devices or regular expressions to match devices to immediately try to hotplug. Watchout for the inconsistency that empty list of devices is now handled as a glob "*". Maybe we could use some special keyword for it, like e.g. "EMPTY" or "none" which could be also used for static instance definition in the TuneD profile.
instance_destroy could destroy all instances not only the dynamically created, i.e. statically created instances with no devices and in case there are some devices attached, it could try to release them and hotplug (like the hotplug code does).

Or maybe for simple PoC just allow destroying instances with no devices and do not distinguish between statically and dynamically created instances, i.e. if there are no devices it can be destroyed.

Jan 18 '24 13:01 yarda

PR updates

Changed scheduler plugin according to https://issues.redhat.com/browse/RHEL-21923
Added tuning of the default irq affinity
Removed (hopefully) all code that doesn't work on Python2.7
other changes commented above

Some open points

Deleting instances: I have removed the _created_dynamically flag and respective checks, so also non-dynamically-created instances can be deleted. Should there be rules here? E.g., what happens if all instances of a plugin are deleted?
Behavior of dynamic instance creation/deletion, and using hotplug: I kept everything explicit and manual for now. At first glance, using hotplug could make this a little harder to use correctly, to avoid accidentally having and irq assigned to the wrong instance. E.g., if we say when an instance with assigned devices is deleted, we just automatically hotplug the devices to other instances -> What happens if there is more than one suitable instance?
How do we want to handle (physical) hotplugging of devices, which leads to adding/removal of IRQs in the system? Do we need to periodically scan /proc/irq and adapt the devices known to the plugin?
Moving devices between plugin instances without running "unapply": That is certainly interesting, and I have some ideas, but I think that's for a future PR.
Preserving state across daemon restarts: In general, I think a daemon restart (without changing the profile) should not have an impact on current tuning, so instances and attached devices should be persistent. But this is a bigger topic, and won't be addressed in this PR.

And one more thing

The goal of this is to be able to tune a device (usually a NIC) to be used exclusively and with low latency by a realtime application. The IRQ affinity is just one part of that. There is also:

IRQ threads (kernel threads irq/*). The kernel automatically migrates them when the IRQ affinity is changed, so we don't need to touch their affinities manually, and in case we need to set their priority I believe we can do that with the scheduler plugin.
NAPI threads (kernel threads napi/*, created for a NIC, depending on the setting in /sys/class/net/%iface%/threaded). Those are created SCHED_OTHER, and we can use the scheduler plugin to give them a suitable policy and priority. Their default affinity is "all cores", and currently we could use the scheduler plugin to set a static affinity for them. But we currently can't dynamically set their affinity.
Maybe other/future kernel threads that we'd also like to migrate dynamically.

This maybe goes beyond the scope if an "irq plugin", but logically it could fit here. But I haven't thought it through, especially the conflicts with the scheduler plugin. Maybe you have an opinion/idea?

Jan 24 '24 08:01 adriaan42

So far I tested the new irq_process option in the scheduler plugin and it seems to work nice. When doing the final rebase/squash, you should probably add Resolves: RHEL-21923 into the commit message.

I have removed the _created_dynamically flag and respective checks, so also non-dynamically-created instances can be deleted. Should there be rules here? E.g., what happens if all instances of a plugin are deleted?

I can't think of a reason why deleting the last plugin instance should be disallowed. Technically, we could also permit creating instances of inactive plugins via the create_instance command (though I'm not sure whether that has a real-life use case).

E.g., if we say when an instance with assigned devices is deleted, we just automatically hotplug the devices to other instances -> What happens if there is more than one suitable instance?

IIUC the hotplug implementation currently assigns new devices to suitable plugin instances based on their priority option, as sorted here, so maybe that code could be re-used (and possibly overrided in the future if we want custom behavior in specific plugins). Although to be able to set the priority of newly created instances, the code would need to be refactored a bit.

Jan 30 '24 11:01 zacikpa

Fedora rawhide CI failure is caused by regression in the kernel: https://bugzilla.redhat.com/show_bug.cgi?id=2262526

Feb 08 '24 10:02 yarda

@adriaan42 please fix the CodeQL found problems.

Feb 08 '24 11:02 yarda

IMHO: dee5d6f63aab46427ade7415babf3c5ef1488c52 - it's OK for merge now aad02c76e152bed9aea7a34846a013278db973f8 - tighten the policy (per inline comment) and it's probably OK 17b528c0d2065ff902d6ceb00407569c2a8f2c05 - this still needs some work, see comments

Feb 08 '24 12:02 yarda

Some open points

Deleting instances: I have removed the _created_dynamically flag and respective checks, so also non-dynamically-created instances can be deleted. Should there be rules here? E.g., what happens if all instances of a plugin are deleted?

As @zacikpa wrote I don't think we need any limitation here. Currently, at runtime I think you won't be able to achieve it, because through the API you can only move devices between instances, thus there will be always at least one instance with some devices which cannot be removed. If you startup with empty instances then you will be able to delete all instances, but I think it shouldn't cause any problem.

How do we want to handle (physical) hotplugging of devices, which leads to adding/removal of IRQs in the system? Do we need to periodically scan /proc/irq and adapt the devices known to the plugin?

This is usually done by udev. It seems udev can parse the irqs:

udevadm info /sys/kernel/irq/59
P: /kernel/irq/59
M: 59

But I don't know whether it support hotplug for irqs. If yes we are probably done just by switching to udev enumeration. If not, we probably cannot support hotplug of irqs without periodic rescans and in such case I wouldn't cope with it for now.

Moving devices between plugin instances without running "unapply": That is certainly interesting, and I have some ideas, but I think that's for a future PR.

OK, I agree.

Preserving state across daemon restarts: In general, I think a daemon restart (without changing the profile) should not have an impact on current tuning, so instances and attached devices should be persistent. But this is a bigger topic, and won't be addressed in this PR.

I agree.

And one more thing

The goal of this is to be able to tune a device (usually a NIC) to be used exclusively and with low latency by a realtime application. The IRQ affinity is just one part of that. There is also:

IRQ threads (kernel threads irq/*). The kernel automatically migrates them when the IRQ affinity is changed, so we don't need to touch their affinities manually, and in case we need to set their priority I believe we can do that with the scheduler plugin.

NAPI threads (kernel threads napi/*, created for a NIC, depending on the setting in /sys/class/net/%iface%/threaded). Those are created SCHED_OTHER, and we can use the scheduler plugin to give them a suitable policy and priority. Their default affinity is "all cores", and currently we could use the scheduler plugin to set a static affinity for them. But we currently can't dynamically set their affinity.

Maybe other/future kernel threads that we'd also like to migrate dynamically.

This maybe goes beyond the scope if an "irq plugin", but logically it could fit here. But I haven't thought it through, especially the conflicts with the scheduler plugin. Maybe you have an opinion/idea?

I think we can duplicate the functionality in plugins if necessary and cover the limitations in documentation. Otherwise I don't have more ideas at the moment.

Feb 08 '24 14:02 yarda

IMHO: dee5d6f - it's OK for merge now aad02c7 - tighten the policy (per inline comment) and it's probably OK 17b528c - this still needs some work, see comments

Maybe could you split the PR? The first commit is ready, the second almost ready and we would like to have them in the upcoming release which will happen very soon. The last commit (irq plugin) may take some more time.

Feb 08 '24 14:02 yarda

IMHO: dee5d6f - it's OK for merge now aad02c7 - tighten the policy (per inline comment) and it's probably OK 17b528c - this still needs some work, see comments

Maybe could you split the PR? The first commit is ready, the second almost ready and we would like to have them in the upcoming release which will happen very soon. The last commit (irq plugin) may take some more time.

Yes, the commits are pretty independent and I'm happy to split the PR. I agree that the first one is ready. For the second one, I'm not sure: this currently does not consider hotplugging, and with hotplugging we might need additional parameters for the instance_create call, like a devices_udev_regex and an instance priority. It looks like the priority can be passed as part of the options, but I'm not sure about the udev regex. That seems to be a separate argument to the instance creation.

I'll open independent PRs and we can discuss there...

Feb 08 '24 14:02 adriaan42

IMHO: dee5d6f - it's OK for merge now aad02c7 - tighten the policy (per inline comment) and it's probably OK 17b528c - this still needs some work, see comments

Maybe could you split the PR? The first commit is ready, the second almost ready and we would like to have them in the upcoming release which will happen very soon. The last commit (irq plugin) may take some more time.

Yes, the commits are pretty independent and I'm happy to split the PR. I agree that the first one is ready. For the second one, I'm not sure: this currently does not consider hotplugging, and with hotplugging we might need additional parameters for the instance_create call, like a devices_udev_regex and an instance priority. It looks like the priority can be passed as part of the options, but I'm not sure about the udev regex. That seems to be a separate argument to the instance creation.

I'll open independent PRs and we can discuss there...

Thanks. Regarding the second PR I thought it's good enough for the basic functionality and could be improved later. But if you want to do it right +1 from me.

Feb 08 '24 14:02 yarda

Thanks. Regarding the second PR I thought it's good enough for the basic functionality and could be improved later. But if you want to do it right +1 from me.

I was worried that we might need to change the dbus interface later to support hotplugging. But I had another look, and I believe we could still pass all required argument in options via dbus, and can then extract them just like the Unit constructor (https://github.com/redhat-performance/tuned/blob/master/tuned/profiles/unit.py#L12).

So if it's ok for you, we can already use it and refine later. I opened #596, with the updated policies, and a NOTE in the code about the current shortcomings.

Feb 08 '24 15:02 adriaan42

Thanks. Regarding the second PR I thought it's good enough for the basic functionality and could be improved later. But if you want to do it right +1 from me.

I was worried that we might need to change the dbus interface later to support hotplugging. But I had another look, and I believe we could still pass all required argument in options via dbus, and can then extract them just like the Unit constructor (https://github.com/redhat-performance/tuned/blob/master/tuned/profiles/unit.py#L12).

So if it's ok for you, we can already use it and refine later. I opened #596, with the updated policies, and a NOTE in the code about the current shortcomings.

Thanks for the split. Currently, only the first commit is important for the upcoming release. The second commit is nice to have and I think it can wait, so I will keep it open.

Feb 08 '24 15:02 yarda

How do we want to handle (physical) hotplugging of devices, which leads to adding/removal of IRQs in the system? Do we need to periodically scan /proc/irq and adapt the devices known to the plugin?

This is usually done by udev. It seems udev can parse the irqs:
udevadm info /sys/kernel/irq/59
P: /kernel/irq/59
M: 59
But I don't know whether it support hotplug for irqs. If yes we are probably done just by switching to udev enumeration. If not, we probably cannot support hotplug of irqs without periodic rescans and in such case I wouldn't cope with it for now.

Hmm, I am afraid this will not work with udev, I tried simple reproducer:

import pyudev

udev_context = pyudev.Context()
for d in udev_context.list_devices():
  print(d)

Unfortunately, it seems it doesn't know the IRQs.

Feb 08 '24 16:02 yarda

Another PR update:

removed the "dynamic" part from this PR, as this is independent and has been moved to #596
tried to addess most of the review comments
code cleanup, documentation

I'll try to summarize the previous discussion and open topics:

Determine the "first instance": The way as implemented in the cpu plugin does not work (basically the same issue I mentioned in https://github.com/redhat-performance/tuned/pull/366#issuecomment-1738633828):
- I'd like to be able to have a configuration like this:
```
[irq_default]
type=irq
affinity=0
default_affinity=0
devices=*
[irq_special]
type=irq
priority=-1
affinity=2
devices=irq42
```
- So there is a default instance that should handle global/static tuning and all devices not matching any other instance, and then dedicated instance for exceptions.
- The current implementation in the CPU plugin (which is the only other plugin with a "first instance") always takes the instance with the highest priority (lowest value) as "first instance", but by doing that, this instance is started first, and automatically claims all free devices.
- In my current proposal here, I determine the "first instance" by name (instance._is_main_instance = instance.name == instance.plugin.name), but I'm open to other suggestions, e.g. an explicit option (first_instance=True)...
command_custom vs. get/set: I somehow like the custom variant a little better, because it's all in one function and the interface seems a little cleaner...

Mar 11 '24 16:03 adriaan42

Sorry for the delay, but hopefully it's not urgent, although it's very nice RFE.

* Determine the "first instance": The way as implemented in the `cpu` plugin does not work (basically the same issue I mentioned in [cpu: add disable_thread_siblings option #366 (comment)](https://github.com/redhat-performance/tuned/pull/366#issuecomment-1738633828)):
  
  * I'd like to be able to have a configuration like this:
    ```ini
    [irq_default]
    type=irq
    affinity=0
    default_affinity=0
    devices=*
    [irq_special]
    type=irq
    priority=-1
    affinity=2
    devices=irq42
    ```
  * So there is a default instance that should handle global/static tuning and all devices not matching any other instance, and then dedicated instance for exceptions.

You can define the priority also with the order change, e.g.:

[irq_special]
type=irq
affinity=2
devices=irq42

[irq_default]
type=irq
affinity=0
default_affinity=0
devices=*

  * The current implementation in the CPU plugin (which is the only other plugin with a "first instance") always takes the instance with the highest priority (lowest value) as "first instance", but by doing that, this instance is started first, and automatically claims all free devices.
  * In my current proposal here, I determine the "first instance" by name (`instance._is_main_instance = instance.name == instance.plugin.name`), but I'm open to other suggestions, e.g. an explicit option (`first_instance=True`)...

Do we need the first instance here? Although your approach with the first instance being the instance with the plugin name works, it's a new TuneD paradigm. What about using special device name, e.g. "DEFAULT", then:

[irq_default]
type=irq
affinity=0
devices=DEFAULT

I.e. handle the default SMP affinity as a special IRQ, then it could be matched by the glob '*', so you could write:

[irq_default]
type=irq
affinity=0

To cover all remaining IRQs including the default setting (implicit devices=*). In case you don't want the default SMP, which I think you usually want:

[irq_nondefault]
type=irq
affinity=0
devices=!DEFAULT

Then I think you could also use the calc parameter in the affinity which could simplify things. What do you think?

May 16 '24 14:05 yarda

Do we need the first instance here? Although your approach with the first instance being the instance with the plugin name works, it's a new TuneD paradigm. What about using special device name, e.g. "DEFAULT", then:

Good idea! Yes, that works and actually simplifies the code.

Then I think you could also use the calc parameter in the affinity which could simplify things. What do you think?

I'm not entirely sure how to handle the special values ignore/calc present in the scheduler plugin:

We no longer need ignore. An empty affinity list means "don't touch".
calc means "derive affinity from isolated_cores". We don't have that variable in the new plugin, and the user now needs to calculate the desired set explicitly. I think that's ok, and there already are functions available to do this (e.g. cpulist_invert).
calc also means "intersect the desired affinity with the current one". I assume that's so TuneD does not undo affinity settings made by someone else? Is that still needed? If so, then we could introduce a parameter mode which can be set, to always apply the affinity value (default), or intersect to calculate the new value based on the current one.
```
[scheduler]
isolated_cores=${isolcpus}
default_irq_smp_affinity=calc
```
would then become
```
[irq]
affinity=${f:cpulist_invert:${isolcpus}}
devices=!default
[irqdefault]
affinity=${f:cpulist_invert:${isolcpus}}
devices=default
mode=intersect
```

May 17 '24 05:05 adriaan42

* We no longer need `ignore`. An empty affinity list means "don't touch".

OK.

* `calc` means "derive affinity from `isolated_cores`". We don't have that variable in the new plugin, and the user now needs to calculate the desired set explicitly. I think that's ok, and there already are functions available to do this (e.g. `cpulist_invert`).

OK, it's always good to use what's currently available.

* `calc` also means "intersect the desired affinity with the current one". I assume that's so TuneD does not undo affinity settings made by someone else? Is that still needed? If so, then we could introduce a parameter `mode` which can be `set`, to always apply the affinity value (default), or `intersect` to calculate the new value based on the current one.

I think mode is OK, I think it's the most flexible approach.

  ```ini
  [scheduler]
  isolated_cores=${isolcpus}
  default_irq_smp_affinity=calc
  ```
  would then become
  ```ini
  [irq]
  affinity=${f:cpulist_invert:${isolcpus}}
  devices=!default
  [irqdefault]
  affinity=${f:cpulist_invert:${isolcpus}}
  devices=default
  mode=intersect
  ```

I think this can be even simplified to:

[irqdefault]
affinity=${f:cpulist_invert:${isolcpus}}
devices=default
mode=intersect
[irq]
affinity=${f:cpulist_invert:${isolcpus}}

Maybe I would use uppercase name i.e. DEFAULT, as it is some special (virtual) device. IIRC some special values were previously defined in uppercase in TuneD and IMHO there is also lower probability of the name conflict with some existing device name. But if you like the lowercase variant more, I am not against.

May 17 '24 08:05 yarda

Pushed updated implementation:

default_smp_affinity is now handled by a virtual device DEFAULT, eliminating the need for a "first/main instance".
instances have a mode option: set always applies the configured affinity, intersect calculates the new affinity as intersection of the current and the configured one, providing behavior of the calc setting of the scheduler plugin.

May 17 '24 11:05 adriaan42

Thanks, LGTM.

May 21 '24 12:05 yarda

tuned tuned copied to clipboard

Tuning of individual interrupts

PR updates

Some open points

And one more thing

Some open points

And one more thing

tuned
tuned copied to clipboard