cloudberry icon indicating copy to clipboard operation
cloudberry copied to clipboard

walsender connects to the wrong walreceiver

Open my-ship-it opened this issue 2 years ago • 0 comments

Greenplum version or build Tested it on master

Setup Multi host cluster with 4 segment hosts with a spare host Each segment host has 2 primaries and 2 mirrors

mdw sdw1 sdw2 - 2 primaries and 2 mirrors(primaries for these mirrors are on sdw4) sdw3 sdw4 - 2 primaries(mirrors for these primaries are on sdw1) and 2 mirrors

Spare host sdw5

dbid | content | role | preferred_role | mode | status | port | hostname | address ------+---------+------+----------------+------+--------+-------+----------+-------- 1 | -1 | p | p | n | u | 5432 | mdw | mdw
17 | 7 | m | m | s | u | 21001 | sdw1 | sdw1
2 | 0 | p | p | s | u | 20000 | sdw1 | sdw1
16 | 6 | m | m | s | u | 21000 | sdw1 | sdw1
3 | 1 | p | p | s | u | 20001 | sdw1 | sdw1
5 | 3 | p | p | s | u | 20001 | sdw2 | sdw2
4 | 2 | p | p | s | u | 20000 | sdw2 | sdw2
10 | 0 | m | m | s | u | 21000 | sdw2 | sdw2
11 | 1 | m | m | s | u | 21001 | sdw2 | sdw2
12 | 2 | m | m | s | u | 21000 | sdw3 | sdw3
13 | 3 | m | m | s | u | 21001 | sdw3 | sdw3
6 | 4 | p | p | s | u | 20000 | sdw3 | sdw3
7 | 5 | p | p | s | u | 20001 | sdw3 | sdw3
9 | 7 | p | p | s | u | 20001 | sdw4 | sdw4
15 | 5 | m | m | s | u | 21001 | sdw4 | sdw4
14 | 4 | m | m | s | u | 21000 | sdw4 | sdw4
8 | 6 | p | p | s | u | 20000 | sdw4 | sdw4

autoconf options used ( config.status --config )

Installation information ( pg_config )

Expected behavior

Actual behavior

Step to reproduce the behavior

Make sdw1 unreachable by adding a "blackhole" route. Note that all postgress processes are still alive on sdw1

ssh mdw sudo ip route add blackhole <ip_of_sdw1>
ssh sdw2 sudo ip route add blackhole <ip_of_sdw1>
ssh sdw3 sudo ip route add blackhole <ip_of_sdw1>
ssh sdw4 sudo ip route add blackhole <ip_of_sdw1>
ssh sdw5 sudo ip route add blackhole <ip_of_sdw1>




    
  psql postgres -c 'select gp_request_fts_probe_scan()'

    
  Wait for all segments on sdw1 to be marked as 'down'

    
  gprecoverseg -p sdw5

    
  Wait for all segments on sdw5 to be marked as 'up'

    
  gprecoverseg -ar

    
  make sdw5 unreachable by adding a blackhole entry just like sdw1

8.psql postgres -c 'select gp_request_fts_probe_scan()' 9.Wait for all segs on sdw5 to be marked as 'down'

  Reconnect sdw1 - make it reachable by deleting the ip entry. Note that we never killed any postgres processes on sdw1


ssh mdw sudo ip route delete <ip_of_sdw1>
ssh sdw2 sudo ip route delete <ip_of_sdw1>
ssh sdw3 sudo ip route delete <ip_of_sdw1>
ssh sdw4 sudo ip route delete <ip_of_sdw1>
ssh sdw5 sudo ip route delete <ip_of_sdw1>

In gp_segment_configuration, the mirrors that belong to sdw5 will be marked as up which is incorrect since sdw5 is still unreachable

RCA Clean cluster: The walsenders on sdw4 are connected to the walreceivers on sdw1. Let's focus on one of the primaries on sdw4 with port 20000 . The corresponding mirror for this primary is on sdw1 with port 21000

ssh sdw4 "netstat -avnp | grep '20000'" tcp 0 0 <ip_of_sdw4>:20000 <ip_of_sdw1>:53072 ESTABLISHED 15042/postgres: 200

gpadmin@mdw:/data/gpdata/core$ ssh sdw1 "netstat -avnp | grep '53072'"
tcp        0      0 <ip_of_sdw1>:53072      <ip_of_sdw4>:20000      ESTABLISHED 11524/postgres: 210

After sdw1 is made unreachable and all the segments on sdw1 have been marked as down: Because of this, walsender on sdw4 port 20000 isn't connected to sdw1 anymore.

ssh sdw4 "netstat -avnp | grep '20000'" tcp 0 0 0.0.0.0:20000 0.0.0.0:* LISTEN 14296/postgres tcp6 0 0 :::20000 :::* LISTEN 14296/postgres

But the walreceiver still exists on sdw1

gpadmin@sdw1:~$ ps uxww | grep 21000 | grep walr gpadmin 13168 0.0 0.0 527372 6396 ? Ss 00:22 0:00 postgres: 21000, walreceiver

This walreceiver on sdw1 keeps trying to connect to sdw4 and is stuck at the SYN_SENT step. Even though the server (sdw4) doesn't respond with an ack, the walreceiver keeps trying by sending a SYN_SENT message. This is the root cause of the walsender on sdw4 getting connected to the wrong walreceiver

gpadmin@sdw1:~$ netstat -avnp | grep postgres tcp 0 1 <ip_of_sdw1>:53186 <ip_of_sdw4>:20000 SYN_SENT 13731/postgres: 210

After disconnecting sdw5 and reconnecting sdw1: Now the walreceiver on sdw1 can establish a connection with the walsender on sdw4. Because of this, in gp_segment_configuration, the mirrors that belong to sdw5 will be marked as up which is incorrect since sdw5 is still unreachable This happens because the walsender on sdw4 is now connected to sdw1's walreceiver and hence sdw4's primaries notifies fts that it's mirrors are up although it's true mirrors which are on sdw5 are still not accessible.

my-ship-it avatar Jul 24 '23 07:07 my-ship-it