pg_auto_failover Hung (unclosed) old connections in idle status on the monitor from datanodes

Good afternoon.

pg_auto_failover version: 2.0

We found on several pg_auto_failover clusters the presence on the monitor server of suspended (probably unclosed from the data source) connections in the idle state:

datid	pid	usename		application_name		client_addr	backend_start			state_change			"now() - backend_start"	state	wait_event_type		wait_event	query
16521	272257	autoctl_node	pgautofailover_standby_1	10.9.10.11	2024-01-10 18:36:34.894 +0300	2024-01-10 18:36:36.296 +0300	8 days 14:36:37.430901	idle	Client			ClientRead	
16521	1195688	autoctl_node	pgautofailover_standby_3	10.9.10.13	2024-01-11 17:31:07.396 +0300	2024-01-11 17:31:09.309 +0300	7 days 15:42:04.928758	idle	Client			ClientRead	
16521	1714307	autoctl_node	pgautofailover_standby_1	10.9.10.11	2024-01-12 06:23:28.683 +0300	2024-01-12 06:23:30.095 +0300	7 days 02:49:43.641432	idle	Client			ClientRead	
16521	2335845	autoctl_node	pgautofailover_standby_1	10.9.10.11	2024-01-12 21:51:32.808 +0300	2024-01-12 21:51:34.295 +0300	6 days 11:21:39.51676	idle	Client			ClientRead	
16521	2882026	autoctl_node	pgautofailover_standby_1	10.9.10.11	2024-01-13 11:24:12.232 +0300	2024-01-13 11:24:14.003 +0300	5 days 21:49:00.093055	idle	Client			ClientRead	
16521	3127812	autoctl_node	pgautofailover_standby_1	10.9.10.11	2024-01-13 17:29:09.876 +0300	2024-01-13 17:29:11.251 +0300	5 days 15:44:02.448422	idle	Client			ClientRead	
16521	1580029	autoctl_node	pgautofailover_standby_3	10.9.10.13	2024-01-16 11:13:23.745 +0300	2024-01-16 11:13:25.066 +0300	2 days 21:59:48.579498	idle	Client			ClientRead	
16521	1938664	autoctl_node	pgautofailover_standby_1	10.9.10.11	2024-01-16 20:10:02.488 +0300	2024-01-16 20:10:04.085 +0300	2 days 13:03:09.836981	idle	Client			ClientRead	
16521	2140541	autoctl_node	pgautofailover_standby_3	10.9.10.13	2024-01-17 01:11:00.742 +0300	2024-01-17 01:11:02.403 +0300	2 days 08:02:11.58282	idle	Client			ClientRead	
16521	2275548	autoctl_node	pgautofailover_standby_1	10.9.10.11	2024-01-17 04:31:38.530 +0300	2024-01-17 04:31:40.203 +0300	2 days 04:41:33.794781	idle	Client			ClientRead	
16521	3986569	autoctl_node	pgautofailover_standby_1	10.9.10.11	2024-01-18 22:55:32.673 +0300	2024-01-18 22:55:34.221 +0300	10:17:39.651776		idle	Client			ClientRead	
16521	4040617	autoctl_node	pgautofailover_standby_3	10.9.10.13	2024-01-19 00:15:48.534 +0300	2024-01-19 00:15:50.478 +0300	08:57:23.790725		idle	Client			ClientRead	
16521	209392	autoctl_node	pgautofailover_standby_3	10.9.10.13	2024-01-19 09:13:11.821 +0300	2024-01-19 09:13:11.827 +0300	00:00:00.50327		idle	Client			ClientRead	LISTEN "state"
16521	209398	autoctl_node	pgautofailover_standby_1	10.9.10.11	2024-01-19 09:13:12.223 +0300	2024-01-19 09:13:12.228 +0300	00:00:00.101521		idle	Client			ClientRead	LISTEN "state"

In the example on one of the servers, we see two fresh connections that will work and close, but the rest weigh (some for several days) and do not close. On one of the monitor servers, about 100+ connections have accumulated during its operation.

I understand that these are connections from the date of the year as part of the monitoring protocol https://pg-auto-failover.readthedocs.io/en/main/architecture.html#monitoring-protocol :

Data nodes periodically connect and run SELECT pgautofailover.node_active(…) to communicate their current state and obtain their goal state.

But, for some reason, not all connections are closed and remain stuck in idle status. It is possible that the datanodes do not always close the connection during operation.

Jan 19 '24 06:01 xinferum

I am updating the information: We sometimes have small network lags on these clusters - judging by the pg_auto_failover logs, the data nodes cannot always connect to the monitor. This is what we see in the log of the data node at the moment when it created a connection that hung:

Jan 22 00:50:17 server02.domain pg_autoctl[3583]: 00:50:17 3583 WARN  Failed to connect to "postgres://[email protected]:5432/pg_auto_failover?password=****", retrying until the server is ready
Jan 22 00:50:19 server02.domain pg_autoctl[3583]: 00:50:19 3583 INFO  Successfully connected to "postgres://[email protected]:5432/pg_auto_failover?password=****" after 2 attempts in 3463 ms.

There are no other logs on the servers from either pg_auto_failover or postgresql.

Judging by the state_change from pg_stat_activity, a hung connection appeared at 0:50:17, which corresponds to an unsuccessful attempt to connect the data code to the monitor (according to its log). The datanode probably connected after all, but the connection was severed at that moment due to the network.

We tried to set the client_connection_check_interval and idle_session_timeout parameters - this does not give any result. We assume that (since changing both of these parameters did not give a result) the connection is in the idle state, but the client holds it. Therefore, we tried to restart pg_auto_failover on the datanode from which this connection was created - the connection was immediately closed on the monitor. So pg_auto_failover kept this connection from the data node for some reason and did not close it, despite the fact that once a second (judging by the documentation) it created a new connection to exchange information with the monitor (the cluster worked correctly).

It seems that there is some kind of problem, because of which such connections continue to be held by the datanode on the monitor, despite the fact that no activity occurs in them.

Jan 22 '24 08:01 xinferum

As a temporary solution, we have set up a task on the monitor in pg_cron, which kills these hung-up connections with datanodes with a certain frequency:

select pg_terminate_backend(pid) from pg_stat_activity where usename = 'autoctl_node' and datname = 'pg_auto_failover' and state = 'idle' and (clock_timestamp() - state_change) > '03:00:00';

This solved the problem for us.

Jan 26 '24 05:01 xinferum

Of course, you can simply delete such connections through any JOB, but I would very much like to hear the opinion of the developers. It looks like in case of network problems, connections accumulate in an internal pool...

Jan 29 '24 06:01 sgrinko

Maybe instead of pg_cron, configure idle_session_timeout to some value like a few hours or days.

Apr 05 '24 09:04 Akkowicz

Maybe instead of pg_cron, configure idle_session_timeout to some value like a few hours or days.

Hello.

As I wrote above:

We tried to set the client_connection_check_interval and idle_session_timeout parameters - this does not give any result. We assume that (since changing both of these parameters did not give a result) the connection is in the idle state, but the client holds it. Therefore, we tried to restart pg_auto_failover on the datanode from which this connection was created - the connection was immediately closed on the monitor. So pg_auto_failover kept this connection from the data node for some reason and did not close it, despite the fact that once a second (judging by the documentation) it created a new connection to exchange information with the monitor (the cluster worked correctly).

We tried - it didn't help, apparently pg_auto_failover itself holds the connection.

Apr 05 '24 09:04 xinferum

@xinferum what is your PostgreSQL version?

Apr 05 '24 09:04 Akkowicz

15.4 version of PostgreSQL.

I also want to clarify that this behavior is observed only on one of our projects, but there are several pg_auto_failover clusters on it and such connections appear on all of them.

There are no problems on other projects with a similar cluster configuration, but they have a different operating system.

Apr 05 '24 09:04 xinferum