DESCRIPTION: Support changing CPU priorities for backends and shard moves

Intro This adds support to Citus to change the CPU priority values of backends. This is created with two main usecases in mind:

Users might want to run the logical replication part of the shard moves or shard splits at a higher speed than they would do by themselves. This might cause some small loss of DB performance for their regular queries, but this is often worth it. During high load it's very possible that the logical replication WAL sender is not able to keep up with the WAL that is generated. This is especially a big problem when the machine is close to running out of disk when doing a rebalance.
Users might have certain long running queries that they don't impact their regular workload too much.

Be very careful!!! Using CPU priorities to control scheduling can be helpful in some cases to control which processes are getting more CPU time than others. However, due to an issue called "priority inversion" it's possible that using CPU priorities together with the many locks that are used within Postgres cause the exact opposite behavior of what you intended. This is why this PR only allows the PG superuser to change the CPU priority of its own processes. Currently it's not recommended to set citus.cpu_priority directly. Currently the only recommended interface for users is the setting called citus.cpu_priority_for_logical_replication_senders. This setting controls CPU priority for a very limited set of processes (the logical replication senders). So, the dangers of priority inversion are also limited with when using it for this usecase.

Background Before reading the rest it's important to understand some basic background regarding process CPU priorities, because they are a bit counter intuitive. A lower priority value, means that the process will be scheduled more and whatever it's doing will thus complete faster. The default priority for processes is 0. Valid values are from -20 to 19 inclusive. On Linux a larger difference between values of two processes will result in a bigger difference in percentage of scheduling.

Handling the usecases Usecase 1 can be achieved by setting citus.cpu_priority_for_logical_replication_senders to the priority value that you want it to have. It's necessary to set this both on the workers and the coordinator. Example:

citus.cpu_priority_for_logical_replication_senders = -10

Usecase 2 can with this PR be achieved by running the following as superuser. Note that this is only possible as superuser currently due to the dangers mentioned in the "Be very carefull!!!" section. And although this is possible it's NOT recommended:

ALTER USER background_job_user SET citus.cpu_priority = 5;

OS configuration To actually make these settings work well it's important to run Postgres with more a more permissive value for the 'nice' resource limit than Linux will do by default. By default Linux will not allow a process to set its priority lower than it currently is, even if it was lower when the process originally started. This capability is necessary to reset the CPU priority to its original value after a transaction finishes. Depending on how you run Postgres this needs to be done in one of two ways:

If you use systemd to start Postgres all you have to do is add a line like this to the systemd service file:

LimitNice=+0 # the + is important, otherwise its interpreted incorrectly as 20

If that's not the case you'll have to configure /etc/security/limits.conf like so, assuming that you are running Postgres as the postgres OS user:

postgres            soft    nice            0
postgres            hard    nice            0

Finally you'd have add the following line to /etc/pam.d/common-session

session required pam_limits.so

These settings would allow to change the priority back after setting it to a higher value.

However, to actually allow you to set priorities even lower than the default priority value you would need to change the values in the config to something lower than 0. So for example:

LimitNice=-10

or

postgres            soft    nice            -10
postgres            hard    nice            -10

If you use WSL2 you'll likely have to do another thing. You have to open a new shell, because when PAM is only used during login, and WSL2 doesn't actually log you in. You can force a login like this:

sudo su $USER --shell /bin/bash

Source: https://stackoverflow.com/a/68322992/2570866

Aug 03 '22 14:08 JelteF

This behaviour is a bit strange:

$ cat /etc/security/limits.conf 
...
marco soft    nice            0
marco hard    nice            0

$ psql
begin;
set local citus.cpu_priority = 1;
select count(*) from test;
abort;
WARNING:  could not set cpu priority to 0: Permission denied
HINT:  Try changing the 'nice' resource limit by changing /etc/security/limits.conf for the postgres userand/or by setting LimitNICE in your the systemd service file (depending on how you start postgres).
ROLLBACK

Aug 04 '22 12:08 marcocitus

@marcocitus I guess you're using WSL2. I added some additional steps to the description of the PR to make it work there:

If you use WSL2 you'll likely have to do another thing. You have to open a new shell, because when PAM is only used during login, and WSL2 doesn't actually log you in. You can force a login like this:
sudo su $USER --shell /bin/bash
Source: https://stackoverflow.com/a/68322992/2570866

Aug 04 '22 12:08 JelteF

Thanks, that resolves the issue.

Aug 04 '22 12:08 marcocitus

I'd caution against this - the likelihood of priority inversion issues is significant in my (admittedly not huge) experience, and can cause way worse performance issues. You end up with situations where a low priority backend holds an lwlock that high priority backends compete for etc.

Aug 04 '22 14:08 anarazel

Codecov Report

Merging #6126 (c6b5982) into main (1a01c89) will increase coverage by 0.01%. The diff coverage is 82.35%.

:exclamation: Current head c6b5982 differs from pull request most recent head 5dab984. Consider uploading reports for the commit 5dab984 to get more accurate results

@@            Coverage Diff             @@
##             main    #6126      +/-   ##
==========================================
+ Coverage   92.92%   92.94%   +0.01%     
==========================================
  Files         252      253       +1     
  Lines       52892    52910      +18     
==========================================
+ Hits        49152    49176      +24     
+ Misses       3740     3734       -6

Aug 16 '22 10:08 codecov[bot]

@anarazel I tried to reduce the bad effects of priority inversion a lot with a few changes.

We only change cpu priorities for logical replication senders that are used for shard moves/splits.
Changing citus.cpu_priority is only allowed for superuser, so regular users cannot cause priority inversion.
We use a GUC to put a hard limit on the number of logical replication senders that have their prioirity increased, because shard splits can cause a lot of logical replication senders.

So, the idea is that this cpu priority feature is only used to make sure that logical replication senders can keep up with the changes. All other backends will keep their regular cpu priority. This should reduce the bad effects of priority inversion because, worst case the logical replication backends will be lowered to the normal priority that they had anyway. Vacuum and checkpointer processes should be able to get enough resources, because there's a hard limit on the number of processes with increased priorities.

Aug 16 '22 10:08 JelteF

Where is the benchmark proving this is even really solving anything?

/me waits for this to blow up in our faces a year or three down the line.

Aug 16 '22 15:08 anarazel

Where is the benchmark proving this is even really solving anything?

Both @onderkalaci and me did benchmarks where we did a shard moves with some load on the cluster (I was running HammerDB). And when we manually decreased the niceness of the logical replication walsender process the shard moves completed much quicker, for me going from 0 niceness to 10 niceness gave a 50% speed boost. Average CPU usage was also kinda low ~10-30% of one core, so I really doubt this will cause issues other than slowing the regular traffic down a little bit.

Aug 16 '22 16:08 JelteF

/me waits for this to blow up in our faces a year or three down the line.

to be fair, catchup taking forever due to starvation blows up in our faces pretty regularly.

Aug 16 '22 16:08 marcocitus

citus citus copied to clipboard

Support changing CPU priorities for backends and shard moves

Codecov Report

citus
citus copied to clipboard