Hung (unclosed) old connections in idle status on the monitor from datanodes
Good afternoon.
pg_auto_failover version: 2.0
We found on several pg_auto_failover clusters the presence on the monitor server of suspended (probably unclosed from the data source) connections in the idle state:
datid pid usename application_name client_addr backend_start state_change "now() - backend_start" state wait_event_type wait_event query
16521 272257 autoctl_node pgautofailover_standby_1 10.9.10.11 2024-01-10 18:36:34.894 +0300 2024-01-10 18:36:36.296 +0300 8 days 14:36:37.430901 idle Client ClientRead
16521 1195688 autoctl_node pgautofailover_standby_3 10.9.10.13 2024-01-11 17:31:07.396 +0300 2024-01-11 17:31:09.309 +0300 7 days 15:42:04.928758 idle Client ClientRead
16521 1714307 autoctl_node pgautofailover_standby_1 10.9.10.11 2024-01-12 06:23:28.683 +0300 2024-01-12 06:23:30.095 +0300 7 days 02:49:43.641432 idle Client ClientRead
16521 2335845 autoctl_node pgautofailover_standby_1 10.9.10.11 2024-01-12 21:51:32.808 +0300 2024-01-12 21:51:34.295 +0300 6 days 11:21:39.51676 idle Client ClientRead
16521 2882026 autoctl_node pgautofailover_standby_1 10.9.10.11 2024-01-13 11:24:12.232 +0300 2024-01-13 11:24:14.003 +0300 5 days 21:49:00.093055 idle Client ClientRead
16521 3127812 autoctl_node pgautofailover_standby_1 10.9.10.11 2024-01-13 17:29:09.876 +0300 2024-01-13 17:29:11.251 +0300 5 days 15:44:02.448422 idle Client ClientRead
16521 1580029 autoctl_node pgautofailover_standby_3 10.9.10.13 2024-01-16 11:13:23.745 +0300 2024-01-16 11:13:25.066 +0300 2 days 21:59:48.579498 idle Client ClientRead
16521 1938664 autoctl_node pgautofailover_standby_1 10.9.10.11 2024-01-16 20:10:02.488 +0300 2024-01-16 20:10:04.085 +0300 2 days 13:03:09.836981 idle Client ClientRead
16521 2140541 autoctl_node pgautofailover_standby_3 10.9.10.13 2024-01-17 01:11:00.742 +0300 2024-01-17 01:11:02.403 +0300 2 days 08:02:11.58282 idle Client ClientRead
16521 2275548 autoctl_node pgautofailover_standby_1 10.9.10.11 2024-01-17 04:31:38.530 +0300 2024-01-17 04:31:40.203 +0300 2 days 04:41:33.794781 idle Client ClientRead
16521 3986569 autoctl_node pgautofailover_standby_1 10.9.10.11 2024-01-18 22:55:32.673 +0300 2024-01-18 22:55:34.221 +0300 10:17:39.651776 idle Client ClientRead
16521 4040617 autoctl_node pgautofailover_standby_3 10.9.10.13 2024-01-19 00:15:48.534 +0300 2024-01-19 00:15:50.478 +0300 08:57:23.790725 idle Client ClientRead
16521 209392 autoctl_node pgautofailover_standby_3 10.9.10.13 2024-01-19 09:13:11.821 +0300 2024-01-19 09:13:11.827 +0300 00:00:00.50327 idle Client ClientRead LISTEN "state"
16521 209398 autoctl_node pgautofailover_standby_1 10.9.10.11 2024-01-19 09:13:12.223 +0300 2024-01-19 09:13:12.228 +0300 00:00:00.101521 idle Client ClientRead LISTEN "state"
In the example on one of the servers, we see two fresh connections that will work and close, but the rest weigh (some for several days) and do not close. On one of the monitor servers, about 100+ connections have accumulated during its operation.
I understand that these are connections from the date of the year as part of the monitoring protocol https://pg-auto-failover.readthedocs.io/en/main/architecture.html#monitoring-protocol :
Data nodes periodically connect and run SELECT pgautofailover.node_active(…) to communicate their current state and obtain their goal state.
But, for some reason, not all connections are closed and remain stuck in idle status. It is possible that the datanodes do not always close the connection during operation.
I am updating the information: We sometimes have small network lags on these clusters - judging by the pg_auto_failover logs, the data nodes cannot always connect to the monitor. This is what we see in the log of the data node at the moment when it created a connection that hung:
Jan 22 00:50:17 server02.domain pg_autoctl[3583]: 00:50:17 3583 WARN Failed to connect to "postgres://[email protected]:5432/pg_auto_failover?password=****", retrying until the server is ready
Jan 22 00:50:19 server02.domain pg_autoctl[3583]: 00:50:19 3583 INFO Successfully connected to "postgres://[email protected]:5432/pg_auto_failover?password=****" after 2 attempts in 3463 ms.
There are no other logs on the servers from either pg_auto_failover or postgresql.
Judging by the state_change from pg_stat_activity, a hung connection appeared at 0:50:17, which corresponds to an unsuccessful attempt to connect the data code to the monitor (according to its log). The datanode probably connected after all, but the connection was severed at that moment due to the network.
We tried to set the client_connection_check_interval and idle_session_timeout parameters - this does not give any result. We assume that (since changing both of these parameters did not give a result) the connection is in the idle state, but the client holds it. Therefore, we tried to restart pg_auto_failover on the datanode from which this connection was created - the connection was immediately closed on the monitor. So pg_auto_failover kept this connection from the data node for some reason and did not close it, despite the fact that once a second (judging by the documentation) it created a new connection to exchange information with the monitor (the cluster worked correctly).
It seems that there is some kind of problem, because of which such connections continue to be held by the datanode on the monitor, despite the fact that no activity occurs in them.
As a temporary solution, we have set up a task on the monitor in pg_cron, which kills these hung-up connections with datanodes with a certain frequency:
select pg_terminate_backend(pid) from pg_stat_activity where usename = 'autoctl_node' and datname = 'pg_auto_failover' and state = 'idle' and (clock_timestamp() - state_change) > '03:00:00';
This solved the problem for us.
Of course, you can simply delete such connections through any JOB, but I would very much like to hear the opinion of the developers. It looks like in case of network problems, connections accumulate in an internal pool...
Maybe instead of pg_cron, configure idle_session_timeout to some value like a few hours or days.
Maybe instead of pg_cron, configure
idle_session_timeoutto some value like a few hours or days.
Hello.
As I wrote above:
We tried to set the client_connection_check_interval and idle_session_timeout parameters - this does not give any result. We assume that (since changing both of these parameters did not give a result) the connection is in the idle state, but the client holds it. Therefore, we tried to restart pg_auto_failover on the datanode from which this connection was created - the connection was immediately closed on the monitor. So pg_auto_failover kept this connection from the data node for some reason and did not close it, despite the fact that once a second (judging by the documentation) it created a new connection to exchange information with the monitor (the cluster worked correctly).
We tried - it didn't help, apparently pg_auto_failover itself holds the connection.
@xinferum what is your PostgreSQL version?
15.4 version of PostgreSQL.
I also want to clarify that this behavior is observed only on one of our projects, but there are several pg_auto_failover clusters on it and such connections appear on all of them.
There are no problems on other projects with a similar cluster configuration, but they have a different operating system.