citus
citus copied to clipboard
Support changing CPU priorities for backends and shard moves
DESCRIPTION: Support changing CPU priorities for backends and shard moves
Intro This adds support to Citus to change the CPU priority values of backends. This is created with two main usecases in mind:
- Users might want to run the logical replication part of the shard moves or shard splits at a higher speed than they would do by themselves. This might cause some small loss of DB performance for their regular queries, but this is often worth it. During high load it's very possible that the logical replication WAL sender is not able to keep up with the WAL that is generated. This is especially a big problem when the machine is close to running out of disk when doing a rebalance.
- Users might have certain long running queries that they don't impact their regular workload too much.
Be very careful!!!
Using CPU priorities to control scheduling can be helpful in some cases
to control which processes are getting more CPU time than others.
However, due to an issue called "priority inversion" it's possible that
using CPU priorities together with the many locks that are used within
Postgres cause the exact opposite behavior of what you intended. This
is why this PR only allows the PG superuser to change the CPU priority
of its own processes. Currently it's not recommended to set citus.cpu_priority
directly. Currently the only recommended interface for users is the setting
called citus.cpu_priority_for_logical_replication_senders
. This setting
controls CPU priority for a very limited set of processes (the logical
replication senders). So, the dangers of priority inversion are also limited
with when using it for this usecase.
Background Before reading the rest it's important to understand some basic background regarding process CPU priorities, because they are a bit counter intuitive. A lower priority value, means that the process will be scheduled more and whatever it's doing will thus complete faster. The default priority for processes is 0. Valid values are from -20 to 19 inclusive. On Linux a larger difference between values of two processes will result in a bigger difference in percentage of scheduling.
Handling the usecases
Usecase 1 can be achieved by setting citus.cpu_priority_for_logical_replication_senders
to the priority value that you want it to have. It's necessary to set
this both on the workers and the coordinator. Example:
citus.cpu_priority_for_logical_replication_senders = -10
Usecase 2 can with this PR be achieved by running the following as superuser. Note that this is only possible as superuser currently due to the dangers mentioned in the "Be very carefull!!!" section. And although this is possible it's NOT recommended:
ALTER USER background_job_user SET citus.cpu_priority = 5;
OS configuration To actually make these settings work well it's important to run Postgres with more a more permissive value for the 'nice' resource limit than Linux will do by default. By default Linux will not allow a process to set its priority lower than it currently is, even if it was lower when the process originally started. This capability is necessary to reset the CPU priority to its original value after a transaction finishes. Depending on how you run Postgres this needs to be done in one of two ways:
If you use systemd to start Postgres all you have to do is add a line like this to the systemd service file:
LimitNice=+0 # the + is important, otherwise its interpreted incorrectly as 20
If that's not the case you'll have to configure /etc/security/limits.conf
like so, assuming that you are running Postgres as the postgres
OS user:
postgres soft nice 0
postgres hard nice 0
Finally you'd have add the following line to /etc/pam.d/common-session
session required pam_limits.so
These settings would allow to change the priority back after setting it to a higher value.
However, to actually allow you to set priorities even lower than the default priority value you would need to change the values in the config to something lower than 0. So for example:
LimitNice=-10
or
postgres soft nice -10
postgres hard nice -10
If you use WSL2 you'll likely have to do another thing. You have to open a new shell, because when PAM is only used during login, and WSL2 doesn't actually log you in. You can force a login like this:
sudo su $USER --shell /bin/bash
Source: https://stackoverflow.com/a/68322992/2570866
This behaviour is a bit strange:
$ cat /etc/security/limits.conf
...
marco soft nice 0
marco hard nice 0
$ psql
begin;
set local citus.cpu_priority = 1;
select count(*) from test;
abort;
WARNING: could not set cpu priority to 0: Permission denied
HINT: Try changing the 'nice' resource limit by changing /etc/security/limits.conf for the postgres userand/or by setting LimitNICE in your the systemd service file (depending on how you start postgres).
ROLLBACK
@marcocitus I guess you're using WSL2. I added some additional steps to the description of the PR to make it work there:
If you use WSL2 you'll likely have to do another thing. You have to open a new shell, because when PAM is only used during login, and WSL2 doesn't actually log you in. You can force a login like this:
sudo su $USER --shell /bin/bash
Source: https://stackoverflow.com/a/68322992/2570866
Thanks, that resolves the issue.
I'd caution against this - the likelihood of priority inversion issues is significant in my (admittedly not huge) experience, and can cause way worse performance issues. You end up with situations where a low priority backend holds an lwlock that high priority backends compete for etc.
Codecov Report
Merging #6126 (c6b5982) into main (1a01c89) will increase coverage by
0.01%
. The diff coverage is82.35%
.
:exclamation: Current head c6b5982 differs from pull request most recent head 5dab984. Consider uploading reports for the commit 5dab984 to get more accurate results
@@ Coverage Diff @@
## main #6126 +/- ##
==========================================
+ Coverage 92.92% 92.94% +0.01%
==========================================
Files 252 253 +1
Lines 52892 52910 +18
==========================================
+ Hits 49152 49176 +24
+ Misses 3740 3734 -6
@anarazel I tried to reduce the bad effects of priority inversion a lot with a few changes.
- We only change cpu priorities for logical replication senders that are used for shard moves/splits.
- Changing citus.cpu_priority is only allowed for superuser, so regular users cannot cause priority inversion.
- We use a GUC to put a hard limit on the number of logical replication senders that have their prioirity increased, because shard splits can cause a lot of logical replication senders.
So, the idea is that this cpu priority feature is only used to make sure that logical replication senders can keep up with the changes. All other backends will keep their regular cpu priority. This should reduce the bad effects of priority inversion because, worst case the logical replication backends will be lowered to the normal priority that they had anyway. Vacuum and checkpointer processes should be able to get enough resources, because there's a hard limit on the number of processes with increased priorities.
Where is the benchmark proving this is even really solving anything?
/me waits for this to blow up in our faces a year or three down the line.
Where is the benchmark proving this is even really solving anything?
Both @onderkalaci and me did benchmarks where we did a shard moves with some load on the cluster (I was running HammerDB). And when we manually decreased the niceness of the logical replication walsender process the shard moves completed much quicker, for me going from 0 niceness to 10 niceness gave a 50% speed boost. Average CPU usage was also kinda low ~10-30% of one core, so I really doubt this will cause issues other than slowing the regular traffic down a little bit.
/me waits for this to blow up in our faces a year or three down the line.
to be fair, catchup taking forever due to starvation blows up in our faces pretty regularly.