tuned
tuned copied to clipboard
scheduler plugin
I have several questions w.r.t. the scheduler plugin, which I examined in combination with the realtime profiles. I'm happy to contribute code if it turns out that things can be improved.
Matching of processes/threads (group.*
options):
- Is it intentional that the matching happens against the process'
cmdline
, and not its name/comm
? This can lead to situations where, e.g.,ls ksoftirqd
or/usr/bin/echo "something with rcuc"
will be promoted toSCHED_FIFO
, because the regex matches somewhere in the arguments. - If it is intentional, then the regexes for kernel threads in the profiles should be adapted to include
^\[
, e.g.group.ktimersoftd=0:f:3:*:^\[ktimersoftd.*
- It can be useful to match only certain threads of a process, e.g. only the PMD threads of Open vSwitch, or only the vCPU threads of Qemu. I have a patch that introduces support for this by adding a second (optional) regex field.
group.ovs=1:f:42:*:ovs-vswitchd:^pmd-
could then be used to match and configure OVS's PMD threads.
Non-deterministic system state:
The result of some features depends on the startup order, i.e. if an application is started before or after TuneD, or if TuneD is restarted at some point:
- Changes of process affinity only happen when TuneD starts, and don't affect processes that are launched later.
- Changes of policy/priority also happen for new processes/threads, but the result can still be different, depending on startup order, and also influenced by
perf_process_fork
.
Affinity changes, ps_blacklist/ps_whitelist:
- Why are the affinities of userland processes changed at all? Can't we assume that processes running on isolated cores (
isolcpus
) have been put there deliberately?
I have several questions w.r.t. the scheduler plugin, which I examined in combination with the realtime profiles. I'm happy to contribute code if it turns out that things can be improved.
Things can be definitely improved there. The scheduler plugin bloated over time and it doesn't scale well. Rewrite/refactor and maybe split of the features is on our todo list, but it is not a prio at the moment. There are quite a lot people using it and we don't want to break anybody.
Matching of processes/threads (
group.*
options):* Is it intentional that the matching happens against the process' `cmdline`, and not its name/`comm`? This can lead to situations where, e.g., `ls ksoftirqd` or `/usr/bin/echo "something with rcuc"` will be promoted to `SCHED_FIFO`, because the regex matches somewhere in the arguments.
It's intentional, IIRC it originally (the PoC version of the code) matched the COMM, but it was changed, because with the cmdline:
- it's more flexible, because it can match the args (I agree that the regexes in the upstream profiles maybe worth improving, not to accidentally match the args :)
- it was required to match the processes that changed their COMM via the prctl call and to match them according to their argv name
- there needed to be an easy and unambiguous way how to match the kernel threads
Unfortunately, it brings in some performance problems. The COMM can be directly read from the perf, but the cmdline cannot. For the cmdline there has to be lookup in the /proc FS which is quite performance bottleneck. Moreover the COMM can be used for matching of the custom thread names.
So we will probably add the support for the COMM and allow the user to choose (through some option).
* If it is intentional, then the regexes for kernel threads in the profiles should be adapted to include `^\[`, e.g. `group.ktimersoftd=0:f:3:*:^\[ktimersoftd.*`
Yes, otherwise with one regex it could be hard to distinguish between processes and kernel threads.
* It can be useful to match only certain threads of a process, e.g. only the PMD threads of Open vSwitch, or only the vCPU threads of Qemu. I have a patch that introduces support for this by adding a second (optional) regex field. `group.ovs=1:f:42:*:ovs-vswitchd:^pmd-` could then be used to match and configure OVS's PMD threads.
I agree. Another regex or option what to match the current regex against.
Non-deterministic system state:
The result of some features depends on the startup order, i.e. if an application is started before or after TuneD, or if TuneD is restarted at some point:
* Changes of process affinity only happen when TuneD starts, and don't affect processes that are launched later.
This is incomplete feature, it was intended to fully process the new processes, but this feature has been never completed. At first due to the tight schedule, later, because it could pose more performance overhead and the function was duped by the isolcpus. I think it should fully process the processes, but it may need some option (like e.g. we added the perf_process_fork) or better optimization.
* Changes of policy/priority also happen for new processes/threads, but the result can still be different, depending on startup order, and also influenced by `perf_process_fork`.
Affinity changes, ps_blacklist/ps_whitelist:
* Why are the affinities of userland processes changed at all? Can't we assume that processes running on isolated cores (`isolcpus`) have been put there deliberately?
It's because the plugin can be used without the isolcpus, e.g. with the systemd affinities or even without it. With the isolcpus this functionality is probably not needed, although IIRC there were some problems with some threads that had to be moved manually (maybe these problems have been already fixed in the kernel).
@yarda Thanks for the clarifications!
I guess fully monitoring all relevant events w.r.t. affinities and scheduling policies/priorities is difficult, and would probably mean intercepting all related syscalls using perf tracepoints. In a controlled system with "well-behaved" applications, it is probably a better strategy to set those parameters manually, and maybe monitor by periodically calling verify.
I'm probably the wrong person to attempt larger refactorings, but I do have some suggestions for incremental improvements of the current implementation, and will post some PRs.