gpdb
gpdb copied to clipboard
Reduce TCP port usage on the QD
QD will establish TCP connections to each QE, each connection has a unique
<source address, source port, destination address, destination port> quadruple.
To reduce the usage of TCP ports on QD, this commit sets the SO_REUSEADDR
option on sockets, to share the same port for connections that AllocateGang()
connects to QEs on different hosts.
This was introduced by https://github.com/greenplum-db/gpdb/pull/577, but accidentally removed by https://github.com/greenplum-db/gpdb/pull/3608.
It's harmless for frontends too, so enable it for both.
To fully understand how this SO_REUSEADDR works, please review how each connection between QD and QE establishes and this blog https://gavv.net/articles/ephemeral-port-reuse/.
Do we really need SO_REUSEADDR for non-binding ephemeral ports on QD (client of segment ?) side ?
There is a similar history PR #8884 adding SO_REUSEADDR as socket option. I think it's right timing before a bind() system call there, inside either setupUDPListeningSocket() or setupTCPListeningSocket() , these two listener methods.
To fully understand how this
SO_REUSEADDRworks, please review how each connection between QD and QE establishes and this blog https://gavv.net/articles/ephemeral-port-reuse/.
In this blog, the author mentioned a race condition between the bind() and listen() system calls, which may cause port conflicts. While by my experiment on kernel 5.15, the conflicts never happened in tcp mode (it's fixed in latest kernel, the author mentioned it in the end of the blog), but would happen in udp mode. Anyway, for portability, we can always check the errno EADDRINUSE returned by the listen() call, and retry if needed, as @paul-guo- suggested here .
Do we really need SO_REUSEADDR for non-binding ephemeral ports on QD (client) side?
I thought it works for outgoing connections too, #577 said so. I will check it.
Do we really need SO_REUSEADDR for non-binding ephemeral ports on QD (client) side?
I thought it works for outgoing connections too, #577 said so. I will check it.
Thanks. I feel maybe better to be consistent with upstream for libpq on general behaviors even if we dispatch to multiple QEs on the same destination host. I think ephemeral port plus tcp_tw_resuse=1 could be sufficient to support multiple outgoing connections to one single or multiple hosts in our cases.
After reading some stuff, I think this PR's goal makes sense. Here are my thoughts:
Some backgrounds
SO_REUSEADDRoption deals with TCPTIME_WAITstate, it's equal totcp_tw_reuse=1, can make a tcp port reused quickly.- The calling
close()side of a TCP connection will go to theTIME_WAITstate. And in the best practice of network programming, the server side should callclose()first.
SO_REUSEADDR has two common scenarios:
- Set it in server side, after
bind()a service IP:Port. This can make the specific port be reused quickly (e.g. restart one server program), as @Aegeaner said before. - Set it in client side, to mitigate TCP ephemeral port exhaustion. Note, it means "mitigation", the better way is not making client to go to
TIME_WAITstate. (server sideclose()first, follow the best practice).
Unluckily, in gpdb: QD acts as client side, via libpq to access many QEs (acts as server side), and QD close first, so it suffers from the long time TIME_WAIT state. UsingSO_REUSEADDR can prevent TCP ephemeral port (near 60k in one host) from being exhausted.
Finally, how to get to the target?
One is set it in code (this PR), another is @haolinw said: keep code same as upstream, and set tcp_tw_resuse=1 + big ip_local_port_range via sysctl. (but official doc doesn't mention them: https://gpdb.docs.pivotal.io/6-20/install_guide/prep_os.html#sysctl_file).
If customers often encounter this problem, I think setting it in code makes sense, but need to verify it.
I did a test, on a 1 coordinator + 2x2 segments cluster, open 4 psql, each psql runs a single slice query. The outgoing connections use 16 outgoing ports, with or without this PR.
gpadmin=# select * from gp_segment_configuration;
dbid | content | role | preferred_role | mode | status | port | hostname | address | datadir
------+---------+------+----------------+------+--------+-------+-------------+-------------+------------------------------------------------
1 | -1 | p | p | n | u | 15432 | minion | minion | /home/gpadmin/greenplum-db-data/gpseg-1
2 | 0 | p | p | n | u | 40000 | minion-seg1 | minion-seg1 | /home/gpadmin/greenplum-db-data/primary/gpseg0
4 | 2 | p | p | n | u | 40000 | minion-seg2 | minion-seg2 | /home/gpadmin/greenplum-db-data/primary/gpseg2
3 | 1 | p | p | n | u | 40001 | minion-seg1 | minion-seg1 | /home/gpadmin/greenplum-db-data/primary/gpseg1
5 | 3 | p | p | n | u | 40001 | minion-seg2 | minion-seg2 | /home/gpadmin/greenplum-db-data/primary/gpseg3
gpadmin@minion:~$ netstat -tnp|grep postgres|sort|grep "35:"
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 10.117.190.35:39770 10.117.190.124:40000 ESTABLISHED 5790/postgres: 1543
tcp 0 0 10.117.190.35:39774 10.117.190.124:40000 ESTABLISHED 5802/postgres: 1543
tcp 0 0 10.117.190.35:39776 10.117.190.124:40000 ESTABLISHED 5825/postgres: 1543
tcp 0 0 10.117.190.35:39778 10.117.190.124:40000 ESTABLISHED 5848/postgres: 1543
tcp 0 0 10.117.190.35:43034 10.117.190.47:40001 ESTABLISHED 5790/postgres: 1543
tcp 0 0 10.117.190.35:43038 10.117.190.47:40001 ESTABLISHED 5802/postgres: 1543
tcp 0 0 10.117.190.35:43040 10.117.190.47:40001 ESTABLISHED 5825/postgres: 1543
tcp 0 0 10.117.190.35:43042 10.117.190.47:40001 ESTABLISHED 5848/postgres: 1543
tcp 0 0 10.117.190.35:51794 10.117.190.124:40001 ESTABLISHED 5790/postgres: 1543
tcp 0 0 10.117.190.35:51796 10.117.190.124:40001 ESTABLISHED 5802/postgres: 1543
tcp 0 0 10.117.190.35:51798 10.117.190.124:40001 ESTABLISHED 5825/postgres: 1543
tcp 0 0 10.117.190.35:51800 10.117.190.124:40001 ESTABLISHED 5848/postgres: 1543
tcp 0 0 10.117.190.35:55482 10.117.190.47:40000 ESTABLISHED 5790/postgres: 1543
tcp 0 0 10.117.190.35:55484 10.117.190.47:40000 ESTABLISHED 5802/postgres: 1543
tcp 0 0 10.117.190.35:55486 10.117.190.47:40000 ESTABLISHED 5825/postgres: 1543
tcp 0 0 10.117.190.35:55488 10.117.190.47:40000 ESTABLISHED 5848/postgres: 1543
gpadmin@minion:~$ netstat -tnp|grep postgres|sort|grep "10.117.190.35:"|wc -l
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
16
Is it proved that SO_REUSEADDR makes no difference for outgoing connections? Or did I test it wrongly?
Is it proved that SO_REUSEADDR makes no difference for outgoing connections? Or did I test it wrongly?
I think needs to consume up all outgoing ports (maybe by narrowing down net.ipv4.ip_local_port_range), then trigger more concurrent outgoing connections and check messages of EADDRINUSE and EADDRNOTAVAIL. The expectation is without optimization, should be able to see EADDRINUSE or EADDRNOTAVAIL returned by connect() in strace output; with this optimization, no those errors. Otherwise, I think no need to apply SO_REUSEADDR to outgoing connection.
Is it proved that
SO_REUSEADDRmakes no difference for outgoing connections? Or did I test it wrongly?
I guess when QD calls connect, OS will allocate a 'Completely unused" port for it. it will not reuse the old port even if the old port can be "REUSED".
Try to limit the port allocated by OS usingip_local_port_range, and make OS using the old port.
Even more confused, the ports are reused with or without this PR, after I gave ip_local_port_range a small range.
$ uname -a
Linux minion 5.15.0-41-generic #44-Ubuntu SMP Wed Jun 22 14:20:53 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 40000 40200
$ sysctl net.ipv4.tcp_tw_reuse
net.ipv4.tcp_tw_reuse = 2 <--- loopback only, doesn't matter
$ netstat -tnp|grep postgres|sort|grep 10.117.190.35|wc -l
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
240
tcp 0 0 10.117.190.35:40190 10.117.190.124:40000 ESTABLISHED 69146/postgres: 154
tcp 0 0 10.117.190.35:40190 10.117.190.47:40000 ESTABLISHED 69156/postgres: 154
tcp 0 0 10.117.190.35:40190 10.117.190.47:40001 ESTABLISHED 69154/postgres: 154
tcp 0 0 10.117.190.35:40192 10.117.190.124:40000 ESTABLISHED 69161/postgres: 154
tcp 0 0 10.117.190.35:40192 10.117.190.47:40000 ESTABLISHED 69155/postgres: 154
tcp 0 0 10.117.190.35:40192 10.117.190.47:40001 ESTABLISHED 69134/postgres: 154
tcp 0 0 10.117.190.35:40194 10.117.190.124:40000 ESTABLISHED 69185/postgres: 154
tcp 0 0 10.117.190.35:40194 10.117.190.47:40000 ESTABLISHED 69144/postgres: 154
tcp 0 0 10.117.190.35:40194 10.117.190.47:40001 ESTABLISHED 69158/postgres: 154
tcp 0 0 10.117.190.35:40196 10.117.190.124:40000 ESTABLISHED 69097/postgres: 154
tcp 0 0 10.117.190.35:40196 10.117.190.47:40000 ESTABLISHED 69139/postgres: 154
tcp 0 0 10.117.190.35:40196 10.117.190.47:40001 ESTABLISHED 69172/postgres: 154
tcp 0 0 10.117.190.35:40198 10.117.190.124:40000 ESTABLISHED 69194/postgres: 154
tcp 0 0 10.117.190.35:40198 10.117.190.47:40000 ESTABLISHED 69188/postgres: 154
tcp 0 0 10.117.190.35:40198 10.117.190.47:40001 ESTABLISHED 69152/postgres: 154
Above reuse are different (local-IP-address, local-port, foreign-IP-address, foreign-port) 4-tuple.
I did another experiment to check if Greenplum with this PR will reuse the ports in the exact same 4-tuple, but the status is TIME_WAIT.
$ cat ~/test.bash
#!/bin/bash
sudo sysctl -w net.ipv4.ip_local_port_range="40000 40128"
# first round
for (( i = 0; i < 60; i++ )); do
psql -c "select * from t1;"
done
sleep 3
# second round
for (( i = 0; i < 60; i++ )); do
psql -c "set gp_vmem_idle_resource_timeout=120000;select * from t1;select pg_sleep(1000);" &
done
The system is out of ports already at the second round:
psql: error: could not connect to server: Cannot assign requested address
Is the server running on host "minion" (10.117.190.33) and accepting
TCP/IP connections on port 15432?
psql: error: could not connect to server: Cannot assign requested address
Is the server running on host "minion" (10.117.190.33) and accepting
TCP/IP connections on port 15432?
psql: error: could not connect to server: Cannot assign requested address
Is the server running on host "minion" (10.117.190.33) and accepting
TCP/IP connections on port 15432?
...
psql: error: FATAL: interconnect error: Could not set up udp listener socket
DETAIL: bind: Address already in use
psql: error: FATAL: interconnect error: Could not set up udp listener socket
DETAIL: bind: Address already in use
psql: error: FATAL: interconnect error: Could not set up udp listener socket
DETAIL: bind: Address already in use
...
ERROR: failed to acquire resources on one or more segments
DETAIL: could not connect to server: Cannot assign requested address
Is the server running on host "10.117.190.47" and accepting
TCP/IP connections on port 40001?
(seg1 10.117.190.47:40001)
But the TIME_WAIT connections are still there, not reused.
$ netstat -tnp|grep -E "TIME_WAIT"|sort|grep 10.117.190.35|wc -l
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
242
"Cannot assign requested address" is EADDRNOTAVAIL.
So that indicates SO_REUSEADDR doesn't take effect at outgoing TCP requests, right ? I think we don't need this change based on the current observation until we do see some real problem in future.
Yes, based on the current experiment result, socket document, and Linux kernel codes, SO_REUSEADDR doesn't take effect at outgoing TCP requests.
I will close this in several days if there is no objection.
psql: error: FATAL: interconnect error: Could not set up udp listener socket
I'm a little confused, this error is related to UDP (not available port), not TCP.
The SO_REUSEADDR we discussed before is all about TCP (QD->Seg's postmaster), not the interconnect's UDP connections between QD and QEs.
So I think the SO_REUSEADDR is still helpful to the situation:
QD1 ---> tcp (port1) ---> \
QD2 ---> tcp (port2) ------> seg0 postmaster : 6000
QD3 ---> tcp (port3) ---> /
... # other QDs can quickly reuse some existed `TIME_WAIT` TCP port, right?
psql: error: FATAL: interconnect error: Could not set up udp listener socket
I'm a little confused, this error is related to UDP (not available port), not TCP.
The could not connect to server: Cannot assign requested address errors are related to TCP.
My point was the system already run out of ports, but the TIME_WAIT connections were still there, not reused.
If we don't consider TIME_WAIT, the ports are always reused, with or without SO_REUSEADDR.
but the TIME_WAIT connections were still there, not reused.
got it, no problem, thanks.
Seems @adam8157 the conclusion from the PR discussion is code change is not helping. I am with you we can revisit based on complaints instead of speculating the behavior now. I am closing the PR on your behalf, in case something changes please feel free to reopen it.
My point was the system already run out of ports, but the TIME_WAIT connections were still there, not reused.
If we don't consider TIME_WAIT, the ports are always reused, with or without
SO_REUSEADDR.
@adam8157 This topic came up again in some other context and hence wish to seek clarification on above point
- Does setting
tcp_tw_reusehelp and allow the TIME_WAIT ports to be reused on QD or even that makes no difference? - in PR comment Gang mentions he tested and the change with
SO_REUSEADDRworked and without it faced the problem on his test. The PR doesn't have the test code or instructions. Curious how come Gang had seen setting this helping, was it due some Kernel version difference?
https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux This is the best article I have found.
1, Yes, tcp_tw_reuse would help and allow the TIME_WAIT ports to be reused on QD (the client side).
2, SO_REUSEADDR are not supposed to help anything on QD (the client side) as I understand, and I believe the kernel version doesn't matter in this case.
I forgot how I did my experiments, sorry for that.