gpdb Reduce TCP port usage on the QD

trafficstars

QD will establish TCP connections to each QE, each connection has a unique <source address, source port, destination address, destination port> quadruple.

To reduce the usage of TCP ports on QD, this commit sets the SO_REUSEADDR option on sockets, to share the same port for connections that AllocateGang() connects to QEs on different hosts.

This was introduced by https://github.com/greenplum-db/gpdb/pull/577, but accidentally removed by https://github.com/greenplum-db/gpdb/pull/3608.

It's harmless for frontends too, so enable it for both.

May 07 '22 07:05 adam8157

To fully understand how this SO_REUSEADDR works, please review how each connection between QD and QE establishes and this blog https://gavv.net/articles/ephemeral-port-reuse/.

May 07 '22 07:05 adam8157

Do we really need SO_REUSEADDR for non-binding ephemeral ports on QD (client of segment ?) side ?

May 07 '22 08:05 haolinw

There is a similar history PR #8884 adding SO_REUSEADDR as socket option. I think it's right timing before a bind() system call there, inside either setupUDPListeningSocket() or setupTCPListeningSocket() , these two listener methods.

To fully understand how this SO_REUSEADDR works, please review how each connection between QD and QE establishes and this blog https://gavv.net/articles/ephemeral-port-reuse/.

In this blog, the author mentioned a race condition between the bind() and listen() system calls, which may cause port conflicts. While by my experiment on kernel 5.15, the conflicts never happened in tcp mode (it's fixed in latest kernel, the author mentioned it in the end of the blog), but would happen in udp mode. Anyway, for portability, we can always check the errno EADDRINUSE returned by the listen() call, and retry if needed, as @paul-guo- suggested here .

May 07 '22 10:05 Aegeaner

Do we really need SO_REUSEADDR for non-binding ephemeral ports on QD (client) side?

I thought it works for outgoing connections too, #577 said so. I will check it.

May 09 '22 06:05 adam8157

Do we really need SO_REUSEADDR for non-binding ephemeral ports on QD (client) side?

I thought it works for outgoing connections too, #577 said so. I will check it.

Thanks. I feel maybe better to be consistent with upstream for libpq on general behaviors even if we dispatch to multiple QEs on the same destination host. I think ephemeral port plus tcp_tw_resuse=1 could be sufficient to support multiple outgoing connections to one single or multiple hosts in our cases.

May 09 '22 10:05 haolinw

After reading some stuff, I think this PR's goal makes sense. Here are my thoughts:

Some backgrounds

SO_REUSEADDR option deals with TCP TIME_WAIT state, it's equal to tcp_tw_reuse=1, can make a tcp port reused quickly.
The calling close() side of a TCP connection will go to the TIME_WAIT state. And in the best practice of network programming, the server side should call close() first.

SO_REUSEADDR has two common scenarios:

Set it in server side, after bind() a service IP:Port. This can make the specific port be reused quickly (e.g. restart one server program), as @Aegeaner said before.
Set it in client side, to mitigate TCP ephemeral port exhaustion. Note, it means "mitigation", the better way is not making client to go to TIME_WAIT state. (server side close() first, follow the best practice).

Unluckily, in gpdb: QD acts as client side, via libpq to access many QEs (acts as server side), and QD close first, so it suffers from the long time TIME_WAIT state. UsingSO_REUSEADDR can prevent TCP ephemeral port (near 60k in one host) from being exhausted.

Finally, how to get to the target? One is set it in code (this PR), another is @haolinw said: keep code same as upstream, and set tcp_tw_resuse=1 + big ip_local_port_range via sysctl. (but official doc doesn't mention them: https://gpdb.docs.pivotal.io/6-20/install_guide/prep_os.html#sysctl_file).

If customers often encounter this problem, I think setting it in code makes sense, but need to verify it.

May 13 '22 10:05 interma

I did a test, on a 1 coordinator + 2x2 segments cluster, open 4 psql, each psql runs a single slice query. The outgoing connections use 16 outgoing ports, with or without this PR.

gpadmin=# select * from gp_segment_configuration;
 dbid | content | role | preferred_role | mode | status | port  |  hostname   |   address   |                    datadir
------+---------+------+----------------+------+--------+-------+-------------+-------------+------------------------------------------------
    1 |      -1 | p    | p              | n    | u      | 15432 | minion      | minion      | /home/gpadmin/greenplum-db-data/gpseg-1
    2 |       0 | p    | p              | n    | u      | 40000 | minion-seg1 | minion-seg1 | /home/gpadmin/greenplum-db-data/primary/gpseg0
    4 |       2 | p    | p              | n    | u      | 40000 | minion-seg2 | minion-seg2 | /home/gpadmin/greenplum-db-data/primary/gpseg2
    3 |       1 | p    | p              | n    | u      | 40001 | minion-seg1 | minion-seg1 | /home/gpadmin/greenplum-db-data/primary/gpseg1
    5 |       3 | p    | p              | n    | u      | 40001 | minion-seg2 | minion-seg2 | /home/gpadmin/greenplum-db-data/primary/gpseg3

gpadmin@minion:~$ netstat -tnp|grep postgres|sort|grep "35:"
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 10.117.190.35:39770     10.117.190.124:40000    ESTABLISHED 5790/postgres: 1543
tcp        0      0 10.117.190.35:39774     10.117.190.124:40000    ESTABLISHED 5802/postgres: 1543
tcp        0      0 10.117.190.35:39776     10.117.190.124:40000    ESTABLISHED 5825/postgres: 1543
tcp        0      0 10.117.190.35:39778     10.117.190.124:40000    ESTABLISHED 5848/postgres: 1543
tcp        0      0 10.117.190.35:43034     10.117.190.47:40001     ESTABLISHED 5790/postgres: 1543
tcp        0      0 10.117.190.35:43038     10.117.190.47:40001     ESTABLISHED 5802/postgres: 1543
tcp        0      0 10.117.190.35:43040     10.117.190.47:40001     ESTABLISHED 5825/postgres: 1543
tcp        0      0 10.117.190.35:43042     10.117.190.47:40001     ESTABLISHED 5848/postgres: 1543
tcp        0      0 10.117.190.35:51794     10.117.190.124:40001    ESTABLISHED 5790/postgres: 1543
tcp        0      0 10.117.190.35:51796     10.117.190.124:40001    ESTABLISHED 5802/postgres: 1543
tcp        0      0 10.117.190.35:51798     10.117.190.124:40001    ESTABLISHED 5825/postgres: 1543
tcp        0      0 10.117.190.35:51800     10.117.190.124:40001    ESTABLISHED 5848/postgres: 1543
tcp        0      0 10.117.190.35:55482     10.117.190.47:40000     ESTABLISHED 5790/postgres: 1543
tcp        0      0 10.117.190.35:55484     10.117.190.47:40000     ESTABLISHED 5802/postgres: 1543
tcp        0      0 10.117.190.35:55486     10.117.190.47:40000     ESTABLISHED 5825/postgres: 1543
tcp        0      0 10.117.190.35:55488     10.117.190.47:40000     ESTABLISHED 5848/postgres: 1543

gpadmin@minion:~$ netstat -tnp|grep postgres|sort|grep "10.117.190.35:"|wc -l
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
16

Is it proved that SO_REUSEADDR makes no difference for outgoing connections? Or did I test it wrongly?

Jul 27 '22 09:07 adam8157

Is it proved that SO_REUSEADDR makes no difference for outgoing connections? Or did I test it wrongly?

I think needs to consume up all outgoing ports (maybe by narrowing down net.ipv4.ip_local_port_range), then trigger more concurrent outgoing connections and check messages of EADDRINUSE and EADDRNOTAVAIL. The expectation is without optimization, should be able to see EADDRINUSE or EADDRNOTAVAIL returned by connect() in strace output; with this optimization, no those errors. Otherwise, I think no need to apply SO_REUSEADDR to outgoing connection.

Jul 27 '22 09:07 haolinw

Is it proved that SO_REUSEADDR makes no difference for outgoing connections? Or did I test it wrongly?

I guess when QD calls connect, OS will allocate a 'Completely unused" port for it. it will not reuse the old port even if the old port can be "REUSED".

Try to limit the port allocated by OS usingip_local_port_range, and make OS using the old port.

Jul 27 '22 10:07 moonsn

Even more confused, the ports are reused with or without this PR, after I gave ip_local_port_range a small range.

$ uname -a
Linux minion 5.15.0-41-generic #44-Ubuntu SMP Wed Jun 22 14:20:53 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 40000    40200

$ sysctl net.ipv4.tcp_tw_reuse
net.ipv4.tcp_tw_reuse = 2    <---   loopback only, doesn't matter

$ netstat -tnp|grep postgres|sort|grep 10.117.190.35|wc -l
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
240

tcp        0      0 10.117.190.35:40190     10.117.190.124:40000    ESTABLISHED 69146/postgres: 154
tcp        0      0 10.117.190.35:40190     10.117.190.47:40000     ESTABLISHED 69156/postgres: 154
tcp        0      0 10.117.190.35:40190     10.117.190.47:40001     ESTABLISHED 69154/postgres: 154
tcp        0      0 10.117.190.35:40192     10.117.190.124:40000    ESTABLISHED 69161/postgres: 154
tcp        0      0 10.117.190.35:40192     10.117.190.47:40000     ESTABLISHED 69155/postgres: 154
tcp        0      0 10.117.190.35:40192     10.117.190.47:40001     ESTABLISHED 69134/postgres: 154
tcp        0      0 10.117.190.35:40194     10.117.190.124:40000    ESTABLISHED 69185/postgres: 154
tcp        0      0 10.117.190.35:40194     10.117.190.47:40000     ESTABLISHED 69144/postgres: 154
tcp        0      0 10.117.190.35:40194     10.117.190.47:40001     ESTABLISHED 69158/postgres: 154
tcp        0      0 10.117.190.35:40196     10.117.190.124:40000    ESTABLISHED 69097/postgres: 154
tcp        0      0 10.117.190.35:40196     10.117.190.47:40000     ESTABLISHED 69139/postgres: 154
tcp        0      0 10.117.190.35:40196     10.117.190.47:40001     ESTABLISHED 69172/postgres: 154
tcp        0      0 10.117.190.35:40198     10.117.190.124:40000    ESTABLISHED 69194/postgres: 154
tcp        0      0 10.117.190.35:40198     10.117.190.47:40000     ESTABLISHED 69188/postgres: 154
tcp        0      0 10.117.190.35:40198     10.117.190.47:40001     ESTABLISHED 69152/postgres: 154

Jul 28 '22 07:07 adam8157

Above reuse are different (local-IP-address, local-port, foreign-IP-address, foreign-port) 4-tuple.

I did another experiment to check if Greenplum with this PR will reuse the ports in the exact same 4-tuple, but the status is TIME_WAIT.

$ cat ~/test.bash
#!/bin/bash

sudo sysctl -w net.ipv4.ip_local_port_range="40000 40128"

# first round
for (( i = 0; i < 60; i++ )); do
        psql -c "select * from t1;"
done

sleep 3

# second round
for (( i = 0; i < 60; i++ )); do
        psql -c "set gp_vmem_idle_resource_timeout=120000;select * from t1;select pg_sleep(1000);" &
done

The system is out of ports already at the second round:

psql: error: could not connect to server: Cannot assign requested address
        Is the server running on host "minion" (10.117.190.33) and accepting
        TCP/IP connections on port 15432?
psql: error: could not connect to server: Cannot assign requested address
        Is the server running on host "minion" (10.117.190.33) and accepting
        TCP/IP connections on port 15432?
psql: error: could not connect to server: Cannot assign requested address
        Is the server running on host "minion" (10.117.190.33) and accepting
        TCP/IP connections on port 15432?
...
psql: error: FATAL:  interconnect error: Could not set up udp listener socket
DETAIL:  bind: Address already in use
psql: error: FATAL:  interconnect error: Could not set up udp listener socket
DETAIL:  bind: Address already in use
psql: error: FATAL:  interconnect error: Could not set up udp listener socket
DETAIL:  bind: Address already in use
...
ERROR:  failed to acquire resources on one or more segments
DETAIL:  could not connect to server: Cannot assign requested address
        Is the server running on host "10.117.190.47" and accepting
        TCP/IP connections on port 40001?
 (seg1 10.117.190.47:40001)

But the TIME_WAIT connections are still there, not reused.

$ netstat -tnp|grep -E "TIME_WAIT"|sort|grep 10.117.190.35|wc -l
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
242

Jul 28 '22 08:07 adam8157

"Cannot assign requested address" is EADDRNOTAVAIL.

Jul 28 '22 08:07 haolinw

So that indicates SO_REUSEADDR doesn't take effect at outgoing TCP requests, right ? I think we don't need this change based on the current observation until we do see some real problem in future.

Aug 04 '22 01:08 haolinw

Yes, based on the current experiment result, socket document, and Linux kernel codes, SO_REUSEADDR doesn't take effect at outgoing TCP requests.

I will close this in several days if there is no objection.

Aug 04 '22 06:08 adam8157

psql: error: FATAL: interconnect error: Could not set up udp listener socket

I'm a little confused, this error is related to UDP (not available port), not TCP.

The SO_REUSEADDR we discussed before is all about TCP (QD->Seg's postmaster), not the interconnect's UDP connections between QD and QEs.

So I think the SO_REUSEADDR is still helpful to the situation:

QD1  ---> tcp (port1) ---> \ 
QD2  ---> tcp (port2) ------> seg0 postmaster : 6000 
QD3  ---> tcp (port3) ---> /

... # other QDs can quickly reuse some existed `TIME_WAIT` TCP port, right?

Aug 05 '22 04:08 interma

psql: error: FATAL: interconnect error: Could not set up udp listener socket

I'm a little confused, this error is related to UDP (not available port), not TCP.

The could not connect to server: Cannot assign requested address errors are related to TCP.

My point was the system already run out of ports, but the TIME_WAIT connections were still there, not reused.

If we don't consider TIME_WAIT, the ports are always reused, with or without SO_REUSEADDR.

Aug 05 '22 05:08 adam8157

but the TIME_WAIT connections were still there, not reused.

got it, no problem, thanks.

Aug 05 '22 07:08 interma

Seems @adam8157 the conclusion from the PR discussion is code change is not helping. I am with you we can revisit based on complaints instead of speculating the behavior now. I am closing the PR on your behalf, in case something changes please feel free to reopen it.

Aug 16 '22 00:08 ashwinstar

My point was the system already run out of ports, but the TIME_WAIT connections were still there, not reused.

If we don't consider TIME_WAIT, the ports are always reused, with or without SO_REUSEADDR.

@adam8157 This topic came up again in some other context and hence wish to seek clarification on above point

Does setting tcp_tw_reuse help and allow the TIME_WAIT ports to be reused on QD or even that makes no difference?
in PR comment Gang mentions he tested and the change with SO_REUSEADDR worked and without it faced the problem on his test. The PR doesn't have the test code or instructions. Curious how come Gang had seen setting this helping, was it due some Kernel version difference?

Feb 16 '23 17:02 ashwinstar

https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux This is the best article I have found.

1, Yes, tcp_tw_reuse would help and allow the TIME_WAIT ports to be reused on QD (the client side). 2, SO_REUSEADDR are not supposed to help anything on QD (the client side) as I understand, and I believe the kernel version doesn't matter in this case.

I forgot how I did my experiments, sorry for that.

Feb 21 '23 06:02 adam8157

gpdb gpdb copied to clipboard

Reduce TCP port usage on the QD

gpdb
gpdb copied to clipboard