Check WL_SOCKET_CLOSED before using any connection
Fixes https://github.com/citusdata/citus/issues/5538. Another approach for https://github.com/citusdata/citus/pull/5908 on the connection management level, not the executor.
Postgres added support for WL_SOCKET_CLOSED on PG 15 (https://github.com/postgres/postgres/commit/50e570a59e7f86bb41f029a66b781fc79b8d50f1) for kernels that support this event.
This patch is useful for recovering cached connections if the remote node closes the socket. It means that after switchover cases on the worker nodes, we'll not have connection errors such as the following:
select count(*) from users_table;
-- restart one of the worker nodes
pg_ctl -D /Users/onderkalaci/Documents/data_dir/worker_9701 -o "-p 9701" -l /tmp/logfile_9701 restart -m f
-- as of this patch, this query DOESN'T fail
select count(*) from users_table;
ERROR: terminating connection due to administrator command
CONTEXT: while executing command on localhost:9700
Time: 7.978 ms
The patch is NOT useful for:
- Recovering for network failures on the worker node that cannot be detected by the coordinator. Like, the network got cut off on the worker without having chance to close sockets (or let other nodes know about this)
- Recovering from failures during the command execution / distributed execution. This infrastructure can be utilized in the executor later on, but that requires non-trivial changes on the executor state machines. Currently, the executor state machines are not designed to re-connect in case of a failure.
Added @thanodnl as a reviewer, since I remember he tried working on this issue in the past.
Also the isolatote_tenant_to_new_shard failure in CI seems to be caused by this PR.
superseded by #6404