pg_auto_failover memory leak in version 2.0

Hello,

we are experiencing a memory leak in the combination of pg_auto_failover v2.0 and PostgreSQL v15.

The pg_autoctl: node active process is successively taking up all the memory of the server. Subsequently, the OOM intervenes and restarts the pg_autoctl service, and the memory consumption is reset.

Attached is a listing of the processes on our test server where we created a new PostgreSQL HA cluster version 15. The creation occurred on Friday, August 4, 2023 and has already used up about 2GB. There is no traffic over the DB, only communication between the primary, standby and witness server.

postgres  5398  0.0  0.0  79784   244 ?        Ss   Aug04   2:14 /usr/pgsql-15/bin/pg_autoctl run
postgres  5414  0.0  0.0  79788    84 ?        S    Aug04   1:23  \_ pg_autoctl: start/stop postgres
postgres  5515  0.1  0.0 2428240 2004 ?        S    Aug04   4:36  |   \_ /usr/pgsql-15/bin/postgres -D /pgdata/memoryleak/data -p 6011 -h *
postgres  5542  0.0  0.0 255440   164 ?        Ss   Aug04   0:18  |       \_ postgres: memoryleak: logger
postgres  5544  0.0  0.0 2428252  452 ?        Ss   Aug04   0:00  |       \_ postgres: memoryleak: checkpointer
postgres  5545  0.0  0.0 2428244  340 ?        Ss   Aug04   0:01  |       \_ postgres: memoryleak: background writer
postgres  5547  0.0  0.0 2428244  192 ?        Ss   Aug04   0:01  |       \_ postgres: memoryleak: walwriter
postgres  5548  0.0  0.0 2429808  844 ?        Ss   Aug04   0:01  |       \_ postgres: memoryleak: autovacuum launcher
postgres  5549  0.0  0.0 2428244  228 ?        Ss   Aug04   0:00  |       \_ postgres: memoryleak: archiver last was 000000010000000000000009
postgres  5550  0.0  0.0 2429700  544 ?        Ss   Aug04   0:00  |       \_ postgres: memoryleak: logical replication launcher
postgres  6082  0.0  0.0 2432548  792 ?        Ss   Aug04   0:03  |       \_ postgres: memoryleak: walsender pgautofailover_replicator 192.168.202.39(57316) streaming 0/A000000
postgres  5415  2.3  6.7 2304272 2225292 ?     S    Aug04 104:03  \_ pg_autoctl: node active

OS: 
Linux smbdb6a.dmz.skoda.vwg 3.10.0-1160.81.1.el7.x86_64 #1 SMP Thu Nov 24 12:21:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.9 (Maipo)

PostgreSQL: 
postgres=# select version();
                                                 version
---------------------------------------------------------------------------------------------------------
 PostgreSQL 15.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
(1 row)

PAF
/usr/pgsql-15/bin/pg_autoctl version
pg_autoctl version 2.0
pg_autoctl extension version 2.0
compiled with PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
compatible with Postgres 10, 11, 12, 13, and 14

We can provide you with more information if you need it.

We do not encounter this behavior on other versions of PostgreSQL. (10,11,13) By default they take about 80MB of memory over several months of running.:

postgres   961  2.0  0.0  82160  3120 ?        S    May24 2175:13 pg_autoctl: node active
postgres  1276  2.3  0.0  82588  3120 ?        S    Apr06 4229:55 pg_autoctl: node active
postgres  2484  2.4  0.0  82588  3152 ?        S    May24 2644:25 pg_autoctl: node active
postgres  4241  2.0  0.0  82592  3124 ?        S    Jul13 725:24 pg_autoctl: node active
postgres  5415  2.3  7.0 2382372 2303376 ?     S    Aug04 107:38 pg_autoctl: node active
postgres 13853  2.7  0.3 190336 114772 ?       S    10:45   5:03 pg_autoctl: node active
postgres 16450  2.0  0.0  82164  3096 ?        S    May30 2012:36 pg_autoctl: node active
postgres 27700  1.9  0.0  82608  3088 ?        S    Feb23 4714:01 pg_autoctl: node active

5418, 13853 = PostgreSQL 15

Do you have any idea what to do about this please? Is PostgreSQL 15 supported?

Aug 07 '23 11:08 MiroslavDanek

Hello,

we have tried to test this bug on different servers and unfortunately it also appears in PostgreSQL version 14.

For now it seems to us that it is a combination with the version of the operating system - Specifically on RHEL v7. On RHEL v8, the 'pg_autoctl: node active' process takes about 70M of memory.

Please does anyone have a solution to this?

Sep 13 '23 08:09 MiroslavDanek

Hello @MiroslavDanek

I believe your bug requires careful investigation, and time. Having access to a test server or a gdb output will definitively help...

Sep 25 '23 09:09 c2main

There is a way to use make valgrind-session and then do a couple interactive failovers/switchovers and then look at the valgrind reports etc.

Sep 25 '23 18:09 dimitri

We didn't observe this leak on Ubuntu 20.04 and Ubuntu 22.04 over the past 3 years with multiple clusters running. @MiroslavDanek

Have you checked if you maybe have some HardwareCorrupted bytes of memory on your systems? Are there any CE/UE ECC errors logged? I've had some processes experiencing memory leaks in the past due to corruption.

As it seems that only the older systems experience this (RHEL7), this may be plausible that you have some hardware failure in there.

Dec 06 '23 12:12 Akkowicz