memory leak in version 2.0
Hello,
we are experiencing a memory leak in the combination of pg_auto_failover v2.0 and PostgreSQL v15.
The pg_autoctl: node active process is successively taking up all the memory of the server. Subsequently, the OOM intervenes and restarts the pg_autoctl service, and the memory consumption is reset.
Attached is a listing of the processes on our test server where we created a new PostgreSQL HA cluster version 15. The creation occurred on Friday, August 4, 2023 and has already used up about 2GB. There is no traffic over the DB, only communication between the primary, standby and witness server.
postgres 5398 0.0 0.0 79784 244 ? Ss Aug04 2:14 /usr/pgsql-15/bin/pg_autoctl run
postgres 5414 0.0 0.0 79788 84 ? S Aug04 1:23 \_ pg_autoctl: start/stop postgres
postgres 5515 0.1 0.0 2428240 2004 ? S Aug04 4:36 | \_ /usr/pgsql-15/bin/postgres -D /pgdata/memoryleak/data -p 6011 -h *
postgres 5542 0.0 0.0 255440 164 ? Ss Aug04 0:18 | \_ postgres: memoryleak: logger
postgres 5544 0.0 0.0 2428252 452 ? Ss Aug04 0:00 | \_ postgres: memoryleak: checkpointer
postgres 5545 0.0 0.0 2428244 340 ? Ss Aug04 0:01 | \_ postgres: memoryleak: background writer
postgres 5547 0.0 0.0 2428244 192 ? Ss Aug04 0:01 | \_ postgres: memoryleak: walwriter
postgres 5548 0.0 0.0 2429808 844 ? Ss Aug04 0:01 | \_ postgres: memoryleak: autovacuum launcher
postgres 5549 0.0 0.0 2428244 228 ? Ss Aug04 0:00 | \_ postgres: memoryleak: archiver last was 000000010000000000000009
postgres 5550 0.0 0.0 2429700 544 ? Ss Aug04 0:00 | \_ postgres: memoryleak: logical replication launcher
postgres 6082 0.0 0.0 2432548 792 ? Ss Aug04 0:03 | \_ postgres: memoryleak: walsender pgautofailover_replicator 192.168.202.39(57316) streaming 0/A000000
postgres 5415 2.3 6.7 2304272 2225292 ? S Aug04 104:03 \_ pg_autoctl: node active
OS:
Linux smbdb6a.dmz.skoda.vwg 3.10.0-1160.81.1.el7.x86_64 #1 SMP Thu Nov 24 12:21:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.9 (Maipo)
PostgreSQL:
postgres=# select version();
version
---------------------------------------------------------------------------------------------------------
PostgreSQL 15.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
(1 row)
PAF
/usr/pgsql-15/bin/pg_autoctl version
pg_autoctl version 2.0
pg_autoctl extension version 2.0
compiled with PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
compatible with Postgres 10, 11, 12, 13, and 14
We can provide you with more information if you need it.
We do not encounter this behavior on other versions of PostgreSQL. (10,11,13) By default they take about 80MB of memory over several months of running.:
postgres 961 2.0 0.0 82160 3120 ? S May24 2175:13 pg_autoctl: node active
postgres 1276 2.3 0.0 82588 3120 ? S Apr06 4229:55 pg_autoctl: node active
postgres 2484 2.4 0.0 82588 3152 ? S May24 2644:25 pg_autoctl: node active
postgres 4241 2.0 0.0 82592 3124 ? S Jul13 725:24 pg_autoctl: node active
postgres 5415 2.3 7.0 2382372 2303376 ? S Aug04 107:38 pg_autoctl: node active
postgres 13853 2.7 0.3 190336 114772 ? S 10:45 5:03 pg_autoctl: node active
postgres 16450 2.0 0.0 82164 3096 ? S May30 2012:36 pg_autoctl: node active
postgres 27700 1.9 0.0 82608 3088 ? S Feb23 4714:01 pg_autoctl: node active
5418, 13853 = PostgreSQL 15
Do you have any idea what to do about this please? Is PostgreSQL 15 supported?
Hello,
we have tried to test this bug on different servers and unfortunately it also appears in PostgreSQL version 14.
For now it seems to us that it is a combination with the version of the operating system - Specifically on RHEL v7. On RHEL v8, the 'pg_autoctl: node active' process takes about 70M of memory.
Please does anyone have a solution to this?
Hello @MiroslavDanek
I believe your bug requires careful investigation, and time. Having access to a test server or a gdb output will definitively help...
There is a way to use make valgrind-session and then do a couple interactive failovers/switchovers and then look at the valgrind reports etc.
We didn't observe this leak on Ubuntu 20.04 and Ubuntu 22.04 over the past 3 years with multiple clusters running. @MiroslavDanek
Have you checked if you maybe have some HardwareCorrupted bytes of memory on your systems? Are there any CE/UE ECC errors logged? I've had some processes experiencing memory leaks in the past due to corruption.
As it seems that only the older systems experience this (RHEL7), this may be plausible that you have some hardware failure in there.