Found in #1989.

How to reproduce

Check out commit c25948008aa6795ae66d6727238333fa36a2caae
Enable postgres logs by adding pg.config(['log_min_messages=DEBUG2']) right before pg.start() in test_wal_generate.py
Build everything by make -j4
Run a single test: RUST_LOG=debug poetry run pytest -k 'test_wal_generate[last_wal_record_crossing_segment]'

Note how a simple INSERT INTO keys SELECT generate_series(101, 200) in the middle of the test takes 10-15 seconds; see test logs.

Details

The test does the following:

Generates a small amount of data.

2022-07-04 14:00:51.130 INFO [neon_fixtures.py:358] Executing query: CREATE TABLE keys(key int primary key)                                                                                                         
2022-07-04 14:00:52.77  INFO [neon_fixtures.py:358] Executing query: INSERT INTO keys SELECT generate_series(1, 100)
2022-07-04 14:00:52.100 INFO [neon_fixtures.py:358] Executing query: SELECT SUM(key) FROM keys

Generates an interesting WAL by sending special requests to the compute. Specifically, the last WAL record crosses the segment boundary, and there are no WAL records after that.

2022-07-04 14:00:52.142 INFO [neon_fixtures.py:939] Running command "/home/yeputons/neon/neon/target/debug/wal_generate in-existing last_wal_record_crossing_segment host=localhost port=15007 user=cloud_admin dbname=postgres options=-cstatement_timeout=120s"                                                                                                                                                                        
2022-07-04 14:00:52.142 INFO [neon_fixtures.py:940] Running in "/home/yeputons/neon/neon/test_output/test_wal_generate[last_wal_record_crossing_segment]/repo"
2022-07-04 14:00:54.324 INFO [neon_fixtures.py:964] Run success: end_of_wal = 0/2700278

Stops pageserver, stops safekeeper. Starts them back.

2022-07-04 14:00:54.325 INFO [test_wal_generate.py:33] Restarting all safekeepers and pageservers
...
2022-07-04 14:00:58.660 INFO [neon_fixtures.py:964] Run success: Starting pageserver at 'localhost:15003' in '/home/yeputons/neon/neon/test_output/test_wal_generate[last_wal_record_crossing_segment]/repo'
Pageserver started

Does a cold SELECT query, which takes some time for some reasons (but that's out of scope for this issue):

2022-07-04 14:00:58.675 INFO [neon_fixtures.py:358] Executing query: SELECT SUM(key) FROM keys

And the next INSERT now takes 10-15 seconds:

2022-07-04 14:01:00.276 INFO [neon_fixtures.py:358] Executing query: INSERT INTO keys SELECT generate_series(101, 200)
2022-07-04 14:01:14.988 INFO [neon_fixtures.py:358] Executing query: SELECT SUM(key) FROM keys

Logs

Reproduces all the time. It seems that the whole system is (dead?)locked.

The compute (pg.log) thinks there is some backpressure and does not write WAL (sending message len 0) until it does (sending message len 184) for some reason:

2022-07-04 11:01:00.289 GMT [25251] DEBUG:  current flushLsn 0/2700278 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:00.289 GMT [25251] DEBUG:  backpressure throttling: lag 25166456
2022-07-04 11:01:00.299 GMT [25251] DEBUG:  current flushLsn 0/2700278 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:00.299 GMT [25251] DEBUG:  backpressure throttling: lag 25166456
...
2022-07-04 11:01:04.971 GMT [25144] DEBUG:  sending message len 0 beginLsn=0/2700278 endLsn=0/2700278 commitLsn=0/2700278 truncateLsn=0/2700278 to localhost:15005
2022-07-04 11:01:04.971 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: current_timeline_size 0
2022-07-04 11:01:04.971 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: ps_writelsn 0/0
2022-07-04 11:01:04.971 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: ps_flushlsn 0/0
2022-07-04 11:01:04.971 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: ps_applylsn 0/0
2022-07-04 11:01:04.971 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: ps_replytime 710247664971681 reply_time: 2022-07-04 11:01:04.971681+00
2022-07-04 11:01:04.971 GMT [25144] DEBUG:  received message term=2 flushLsn=0/2700278 commitLsn=0/2700278 from localhost:15005
2022-07-04 11:01:04.971 GMT [25144] DEBUG:  GetLatestZentihFeedback: currentClusterSize 0, ps_writelsn 0/0, ps_flushlsn 0/0, ps_applylsn 0/0, ps_replytime 710247664971681
2022-07-04 11:01:04.976 GMT [25251] DEBUG:  current flushLsn 0/2700278 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:04.976 GMT [25251] DEBUG:  backpressure throttling: lag 25166456
...
2022-07-04 11:01:14.324 GMT [25251] DEBUG:  current flushLsn 0/2700278 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:14.324 GMT [25251] DEBUG:  backpressure throttling: lag 25166456
2022-07-04 11:01:14.334 GMT [25251] DEBUG:  current flushLsn 0/2700278 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:14.334 GMT [25251] DEBUG:  backpressure throttling: lag 25166456
2022-07-04 11:01:14.335 GMT [25144] DEBUG:  sending message len 184 beginLsn=0/2700278 endLsn=0/2700330 commitLsn=0/2700278 truncateLsn=0/2700278 to localhost:15005
2022-07-04 11:01:14.335 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: current_timeline_size 0
2022-07-04 11:01:14.335 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: ps_writelsn 0/0
2022-07-04 11:01:14.335 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: ps_flushlsn 0/0
2022-07-04 11:01:14.335 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: ps_applylsn 0/0
2022-07-04 11:01:14.335 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: ps_replytime 710247674335835 reply_time: 2022-07-04 11:01:14.335835+00
2022-07-04 11:01:14.335 GMT [25144] DEBUG:  received message term=2 flushLsn=0/2700330 commitLsn=0/2700278 from localhost:15005
2022-07-04 11:01:14.335 GMT [25144] DEBUG:  GetLatestZentihFeedback: currentClusterSize 0, ps_writelsn 0/0, ps_flushlsn 0/0, ps_applylsn 0/0, ps_replytime 710247674335835
2022-07-04 11:01:14.336 GMT [25144] DEBUG:  write 0/2700330 flush 0/2700330 apply 0/0 reply_time 2022-07-04 11:01:14.336002+00
2022-07-04 11:01:14.336 GMT [25144] DEBUG:  sending message len 0 beginLsn=0/2700330 endLsn=0/2700330 commitLsn=0/2700330 truncateLsn=0/2700330 to localhost:15005
2022-07-04 11:01:14.336 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: current_timeline_size 0
2022-07-04 11:01:14.336 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: ps_writelsn 0/0
2022-07-04 11:01:14.336 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: ps_flushlsn 0/0
2022-07-04 11:01:14.336 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: ps_applylsn 0/0
2022-07-04 11:01:14.336 GMT [25144] DEBUG:  ParseReplicationFeedbackMessage: ps_replytime 710247674336219 reply_time: 2022-07-04 11:01:14.336219+00
2022-07-04 11:01:14.336 GMT [25144] DEBUG:  received message term=2 flushLsn=0/2700330 commitLsn=0/2700278 from localhost:15005
2022-07-04 11:01:14.336 GMT [25144] DEBUG:  GetLatestZentihFeedback: currentClusterSize 0, ps_writelsn 0/0, ps_flushlsn 0/0, ps_applylsn 0/0, ps_replytime 710247674336219
2022-07-04 11:01:14.344 GMT [25251] DEBUG:  current flushLsn 0/2700330 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:14.344 GMT [25251] DEBUG:  backpressure throttling: lag 25166640
2022-07-04 11:01:14.354 GMT [25251] DEBUG:  current flushLsn 0/2700330 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:14.354 GMT [25251] DEBUG:  backpressure throttling: lag 25166640

The safekeeper does nothing until it receives the WAL to write from the compute:

2022-07-04T11:00:58.971766Z  INFO {tid=15}: got query "START_WAL_PUSH postgresql://no_user:@localhost:15003"
2022-07-04T11:00:58.971929Z  INFO {tid=15}:{timeline=9a545ef7f96d4e3ba3e60823c5e36e0f}: loading control file ./9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab/safekeeper.control
2022-07-04T11:00:58.972623Z  INFO {tid=15}:{timeline=9a545ef7f96d4e3ba3e60823c5e36e0f}: timeline 33760f91928b64a7c62dd431507b92ab restored
2022-07-04T11:00:58.972822Z  INFO {tid=15}:{timeline=9a545ef7f96d4e3ba3e60823c5e36e0f}: find_end_of_wal_segment reached zeros at 0/2700278, last records ends at 0/2700278
2022-07-04T11:00:58.972906Z  INFO {tid=15}:{timeline=9a545ef7f96d4e3ba3e60823c5e36e0f}: initialized storage for timeline 33760f91928b64a7c62dd431507b92ab, flush_lsn=0/2700278, commit_lsn=0/2700278, peer_horizon_lsn=0/2700278
2022-07-04T11:00:58.973400Z  INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: start handshake with wal proposer 127.0.0.1:52096 sysid 7116470666150203840 timeline 0
2022-07-04T11:00:58.973724Z  INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: processed greeting from proposer [167, 171, 234, 47, 198, 226, 102, 131, 108, 41, 59, 60, 125, 1, 27, 251], sending term 2
2022-07-04T11:00:58.973852Z  INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: timeline 9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab active=true now
2022-07-04T11:00:58.974083Z  INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: processed VoteRequest for term 2: VoteResponse { term: 2, vote_given: 0, flush_lsn: 0/2700278, truncate_lsn: 0/2700278, term_history: [(2, 0/1696628)], timeline_start_lsn: 0/1696628 }
2022-07-04T11:00:58.974460Z  INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: received ProposerElected ProposerElected { term: 2, start_streaming_at: 0/2700278, term_history: [(2, 0/1696628)], timeline_start_lsn: 0/1696628 }
2022-07-04T11:00:58.977909Z  INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: start receiving WAL since 0/2700278
...
2022-07-04T11:01:14.335673Z  INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: restart decoder from 0/0 to 0/2700278
2022-07-04T11:01:14.513067Z  INFO {tid=17}: got query "IDENTIFY_SYSTEM"
2022-07-04T11:01:14.516828Z  INFO {tid=17}: got query "START_REPLICATION PHYSICAL 0/2700278"
2022-07-04T11:01:14.517410Z  INFO {tid=17}:WAL sender{timeline=33760f91928b64a7c62dd431507b92ab}: Start replication from 0/2700278 till None

The pageserver does almost nothing as well:

2022-07-04T11:00:58.652475Z  INFO Starting an HTTP endpoint at 127.0.0.1:15004
2022-07-04T11:00:58.656905Z DEBUG parsed 2 headers
2022-07-04T11:00:58.656948Z DEBUG incoming body is empty
2022-07-04T11:00:58.657837Z  INFO GET /v1/status 200 OK
2022-07-04T11:00:58.658002Z DEBUG flushed 115 bytes
2022-07-04T11:00:58.659274Z DEBUG read eof
2022-07-04T11:00:58.663641Z DEBUG accepted connection from 127.0.0.1:45588
2022-07-04T11:00:58.663946Z DEBUG Starting thread 'serving Page Service thread'
2022-07-04T11:00:58.664371Z DEBUG SSL requested
2022-07-04T11:00:58.664893Z DEBUG process query "pagestream 9a545ef7f96d4e3ba3e60823c5e36e0f 33760f91928b64a7c62dd431507b92ab"
2022-07-04T11:00:58.678392Z  INFO pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/2663 blkno=1 req_lsn=0/16F4190}: directory "/home/yeputons/neon/neon/test_output/test_wal_generate[last_wal_record_crossing_segment]/repo/tenants/9a545ef7f96d4e3ba3e60823c5e36e0f/wal-redo-datadir" exists, removing
2022-07-04T11:00:58.696829Z  INFO pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/2663 blkno=1 req_lsn=0/16F4190}: running initdb in "/home/yeputons/neon/neon/test_output/test_wal_generate[last_wal_record_crossing_segment]/repo/tenants/9a545ef7f96d4e3ba3e60823c5e36e0f/wal-redo-datadir"
2022-07-04T11:00:59.416032Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:00:59.416576Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:00.253320Z  INFO pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/2663 blkno=1 req_lsn=0/16F4190}: launched WAL redo postgres process on "/home/yeputons/neon/neon/test_output/test_wal_generate[last_wal_record_crossing_segment]/repo/tenants/9a545ef7f96d4e3ba3e60823c5e36e0f/wal-redo-datadir"
2022-07-04T11:01:00.263363Z DEBUG pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/2663 blkno=1 req_lsn=0/16F4190}: postgres applied 2 WAL records in 1584842 us to reconstruct page image at LSN 0/2700278
2022-07-04T11:01:00.272049Z DEBUG pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/16387 blkno=0 req_lsn=0/16F4190}: postgres applied 1 WAL records in 239 us to reconstruct page image at LSN 0/2700278
2022-07-04T11:01:00.428871Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:00.429486Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:01.434604Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:01.434964Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:02.440449Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:02.440940Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:03.446906Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:03.447395Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:04.453535Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:04.453860Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
...
2022-07-04T11:01:13.503325Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:13.503695Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:14.508646Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:14.508943Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:14.509500Z  INFO wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(1), wal_producer_connstr: "host=127.0.0.1 port=15005 options='-c ztimelineid=33760f91928b64a7c62dd431507b92ab ztenantid=9a545ef7f96d4e3ba3e60823c5e36e0f'", reason: NoExistingConnection }
2022-07-04T11:01:14.509807Z  INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: connecting to host=127.0.0.1 port=15005 options='-c ztimelineid=33760f91928b64a7c62dd431507b92ab ztenantid=9a545ef7f96d4e3ba3e60823c5e36e0f'
2022-07-04T11:01:14.512631Z DEBUG walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: executing simple query: IDENTIFY_SYSTEM
2022-07-04T11:01:14.512812Z  INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:safekeeper_handle_db: connected!
2022-07-04T11:01:14.513800Z  INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: IdentifySystem { systemid: 7116470666150203840, timeline: 1, xlogpos: 0/2700330, dbname: None }
2022-07-04T11:01:14.514242Z  INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: last_record_lsn 0/2700278 starting replication from 0/2700278, server is at 0/2700330...
2022-07-04T11:01:14.514317Z DEBUG walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: executing copy both query START_REPLICATION PHYSICAL 0/2700278
2022-07-04T11:01:14.579771Z  INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: caught up at LSN 0/2700330
2022-07-04T11:01:14.580170Z DEBUG walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: zenith_status_update ReplicationFeedback { current_timeline_size: 25395200, ps_writelsn: 40895280, ps_applylsn: 0, ps_flushlsn: 40895096, ps_replytime: SystemTime { tv_sec: 1656932474, tv_nsec: 580141600 } }

It all unlocks when the compute decides to send a non-empty message to the safekeeper at 2022-07-04 11:01:14.335 GMT, the safekeeper calls write_wal, which resets the WalDecoder (which starts reporting LSN greater than zero) at 2022-07-04T11:01:14.335673Z, and then the pageserver decides to reconnect to the safekeeper at 2022-07-04T11:01:14.509500Z because apparently there was NoExistingConnection (why did not it reconnect in the ten last seconds then?)

Jul 04 '22 12:07 yeputons

Thanks for the detailed description. It really helps when investigating the issue.

Just taking a look into this. It seems that this issue happens when we "Restarting all safekeepers and pageservers" after running the WAL generation. Comment the "restart" part out, no large request time happens.

Based on the PG log:

2022-07-04 11:01:14.336 GMT [25144] DEBUG:  GetLatestZentihFeedback: currentClusterSize 0, ps_writelsn 0/0, ps_flushlsn 0/0, ps_applylsn 0/0, ps_replytime 710247674336219
2022-07-04 11:01:14.344 GMT [25251] DEBUG:  current flushLsn 0/2700330 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:14.344 GMT [25251] DEBUG:  backpressure throttling: lag 25166640

Look like upon restarting the pageserver and compute notes, the ZenithFeedback data stored inside walprop_shared->feedback: WalproposerShmemState hasn't been initialized, which is the reason for the big LSN lag.

Digging a bit deeper, I found that we only send a ZenithFeedback from pageserver in

https://github.com/neondatabase/neon/blob/main/pageserver/src/walreceiver/walreceiver_connection.rs#L245-L251

This requires to establish a WAL receiver connection. However, on pageserver restart, etcd doesn't immediately start a new WAL connection. Instead, it spent quite a while processing some update events before starting a new connection candidate.

2022-07-04T11:01:00.253320Z  INFO pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/2663 blkno=1 req_lsn=0/16F4190}: launched WAL redo postgres process on "/home/yeputons/neon/neon/test_output/test_wal_generate[last_wal_record_crossing_segment]/repo/tenants/9a545ef7f96d4e3ba3e60823c5e36e0f/wal-redo-datadir"
2022-07-04T11:01:00.263363Z DEBUG pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/2663 blkno=1 req_lsn=0/16F4190}: postgres applied 2 WAL records in 1584842 us to reconstruct page image at LSN 0/2700278
2022-07-04T11:01:00.272049Z DEBUG pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/16387 blkno=0 req_lsn=0/16F4190}: postgres applied 1 WAL records in 239 us to reconstruct page image at LSN 0/2700278
2022-07-04T11:01:00.428871Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:00.429486Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:01.434604Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:01.434964Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:02.440449Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:02.440940Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:03.446906Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:03.447395Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:04.453535Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:04.453860Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
...
2022-07-04T11:01:13.503325Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:13.503695Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:14.508646Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:14.508943Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:14.509500Z  INFO wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(1), wal_producer_connstr: "host=127.0.0.1 port=15005 options='-c ztimelineid=33760f91928b64a7c62dd431507b92ab ztenantid=9a545ef7f96d4e3ba3e60823c5e36e0f'", reason: NoExistingConnection }

After switching to a new connection, pageserver will process WALs and send a replication feedback, which will notify the compute node about the actual lags

2022-07-04T11:01:14.514242Z  INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: last_record_lsn 0/2700278 starting replication from 0/2700278, server is at 0/2700330...
2022-07-04T11:01:14.514317Z DEBUG walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: executing copy both query START_REPLICATION PHYSICAL 0/2700278
2022-07-04T11:01:14.579771Z  INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: caught up at LSN 0/2700330
2022-07-04T11:01:14.580170Z DEBUG walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: zenith_status_update ReplicationFeedback { current_timeline_size: 25395200, ps_writelsn: 40895280, ps_applylsn: 0, ps_flushlsn: 40895096, ps_replytime: SystemTime { tv_sec: 1656932474, tv_nsec: 580141600 } }

I still don't know the underlying reason for the above behaviour. Maybe someone who is more familiar with etcd code can take a look.

cc @SomeoneToIgnore WDYT?

Jul 04 '22 17:07 aome510

My wild guess is that some timeout happens that forces etcd to start a new connection, which is the reason why the insert's request time is about 15secs every time.

Jul 04 '22 17:07 aome510

Most probably, those

Processing 1 events

are the other ones sent by safekeepers, related to its WAL offload. One could check the events with etcdctl --endpoints=etcd-release.local:2379 watch --prefix neon/9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab or similar prefix.

There are no explicit timeouts in pageserver in the related code and neither those are intended.

When you see the wal_connection_manager span logs mentioning subscriptions, it means that the pageserver had successfully subscribed onto a certain key in the etcd and sees something similar to the watch command output above. When it happens for a certain timeline, pageserver never changes its behaviour, sticking to etcd subscription up until it's cancelled from safekeepers' side or something global such as panics, timeline detach or pageserver shutdown happens.

AFAIK, current safekeeper never explicitly cancels pushes such subscriptions (see its broker.rs), yet their TTL is ~10 seconds and should be updated every second with newer values after SK is restarted, so I don't see how that alone could cause errors.

Another possibility for the

Processing 1 events

to be the very related but ignored is when all safekeepers report some Lsn of it that's below what pageserver has: this way they are never considered by pageserver as connection candidates.

Last but not least is an exponential backoff that we add when connecting to a given etcd subscription or safekeeper in a loop, but that's not the case judging from the logs, so not sure if related.

I would start from checking etcd messages with etcdctl to see if etcd is actually working as expected, since there were some concerns about its performance in our workloads.

Jul 04 '22 17:07 SomeoneToIgnore

to be the very related but ignored is when all safekeepers report some Lsn of it that's below what pageserver has: this way they are never considered by pageserver as connection candidates.

I think that's the reason, upon restarting

Look like upon restarting the pageserver and compute notes, the ZenithFeedback data stored inside walprop_shared->feedback: WalproposerShmemState hasn't been initialized, which is the reason for the big LSN lag.

the big lag prevents the compute node from sending any WALs.

Pageserver ignores any new connection request because the commit LSN is no more than its last record:

self.wal_stream_candidates
    .iter()
    .filter(|(_, etcd_info)| {
        etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())
    })
...

No new connection, so no ReplicationFeedback sent to the compute node to initialize the correct write lag. No replication feedback to the compute node means no new WALs (unchanged commit LSN). These things create a dependent cycle/a deadlock but

Why does this work normally after 15s?

Probably, the compute node has some timeout or some background job that will send WAL regardless of the lag value. Notice that in the pg log, after 15 seconds of inactivity, the compute sent a new WAL of len=184 despite the current large write lag:

2022-07-04 11:01:14.324 GMT [25251] DEBUG:  backpressure throttling: lag 25166456
2022-07-04 11:01:14.334 GMT [25251] DEBUG:  current flushLsn 0/2700278 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:14.334 GMT [25251] DEBUG:  backpressure throttling: lag 25166456
2022-07-04 11:01:14.335 GMT [25144] DEBUG:  sending message len 184 beginLsn=0/2700278 endLsn=0/2700330 commitLsn=0/2700278 truncateLsn=0/2700278 to localhost:15005
...

SK receives WAL and updates the commit LSN -> page server creates new connection and send a ReplicationFeedback -> compute node updates the write lag values

Jul 05 '22 19:07 aome510

One way to fix this is to modify WalreceiverState::applicable_connection_candidates to accept new candidate with etcd_info.timeline.commit_lsn >= Some(self.local_timeline.get_last_record_lsn()) if page server has no connection. Otherwise, use etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn()).

@SomeoneToIgnore WDYT?

Jul 05 '22 19:07 aome510

That's a great finding, thank you for looking into that. TBH, this solution feels like a stub to me, fixing the issue with a non-obvious way, but have no better options myself, so might be ok to implement this way, but more comments and unit tests should be added. @LizardWizzard could propose better, I believe 🙂

Jul 05 '22 19:07 SomeoneToIgnore

Why does this work normally after 15s? Probably, the compute node has some timeout or some background job that will send WAL regardless of the lag value.

I dig "a little bit deeper" into this and found that the background job seems to be the background writer's LogStandbySnapshot:

if (XLogStandbyInfoActive() && !RecoveryInProgress())
{
	TimestampTz timeout = 0;
	TimestampTz now = GetCurrentTimestamp();

        // LOG_SNAPSHOT_INTERVAL_MS = 15s
	timeout = TimestampTzPlusMilliseconds(last_snapshot_ts, LOG_SNAPSHOT_INTERVAL_MS); 

	/*
		* Only log if enough time has passed and interesting records have
		* been inserted since the last snapshot.  Have to compare with <=
		* instead of < because GetLastImportantRecPtr() points at the
		* start of a record, whereas last_snapshot_lsn points just past
		* the end of the record.
		*/
	if (now >= timeout &&
		last_snapshot_lsn <= GetLastImportantRecPtr())
	{
		last_snapshot_lsn = LogStandbySnapshot();
		last_snapshot_ts = now;
	}
}

And the behaviour described in this issue can be easily replicated with the below test:

def test_restart_delay(neon_simple_env: NeonEnv):
    env = neon_simple_env

    env.neon_cli.create_branch('test')
    pg = env.postgres.create('test')
    # pg.config(['log_min_messages=DEBUG2'])
    pg.start()

    # general enough WALs to exceed the maximum write lag
    pg.safe_psql_many(queries=[
        'CREATE TABLE foo(key int primary key)',
        'INSERT INTO foo SELECT generate_series(1, 100000)',
    ])

    log.info("sleep 15s to trigger the background writer")
    time.sleep(15)

    env.pageserver.stop()
    env.safekeepers[0].stop()
    env.safekeepers[0].start()
    env.pageserver.start()

    timer = timeit.default_timer()
    pg.safe_psql('INSERT INTO foo SELECT generate_series(100001, 100100)')
    duration = timeit.default_timer() - timer
    # insert 100 rows should not take too much time
    assert duration < 2.0

Jul 06 '22 17:07 aome510

Hmm. Two additional questions bother me:

Is it correct that there is a part of the compute that sends WAL to Safekeepers ignoring backpressure?
Why does this startup issue only happen with XLOG_SWITCH? I suspect it has something to do with the lack of a partial segment.

Jul 07 '22 14:07 yeputons

Is it correct that there is a part of the compute that sends WAL to Safekeepers ignoring backpressure?

Yes. The WAL sender isn't affected by backpressure. It sends new WAL record if there exists one. The one affected by backpressure is probably the WAL generating code. I'm not sure on this. Just from my observation adding logs to the WalSndLoop function. Can someone confirm this behaviour? cc @knizhnik, @hlinnaka.

Why does this startup issue only happen with XLOG_SWITCH? I suspect it has something to do with the lack of a partial segment.

Not too familiar with XLOG_SWITCH, so I don't really understand the question. This startup issue happens consistently if we restart the pageserver right before the background writer happens, then the start time for first write request is about 15s (= period of the background writer).

Jul 07 '22 14:07 aome510

This startup issue happens consistently

Hmm, if I try RUST_LOG=debug poetry run pytest -k 'test_wal_generate[simple]', I do not see such behavior, do you? The difference is the shape of WAL generated.

Jul 07 '22 15:07 yeputons

This startup issue happens consistently

Hmm, if I try RUST_LOG=debug poetry run pytest -k 'test_wal_generate[simple]', I do not see such behavior, do you? The difference is the shape of WAL generated.

Oh I was referring to the new test I added above.

The difference is the shape of WAL generated.

One requirement for the startup issue is a specific size of WALs (> 15MB) needs to be accumulated before the restart to trigger the backpressure on startup. I guess that's not the case for "[simple]" test

Jul 07 '22 16:07 aome510

Is it correct that there is a part of the compute that sends WAL to Safekeepers ignoring backpressure?

Backpressure blocks write transactions. Walsender is not blocked. Autovacuum can set hint bits outside transaction - in this case with log_hint_bits=on autovacuum can produce a lot of WAL and not throttled by backpressure. wal_log_hints is set to true by default but in prod/staging it is set to false (I wonder if we also should change default value).

Jul 07 '22 18:07 knizhnik

@aome510 thanks for the investigation.

One way to fix this is to modify WalreceiverState::applicable_connection_candidates

That would leave the connection open even if there is nothing to do (currently it still works mostly the same, but this is not an excuse). The better way would be for safekeepers to announce existence of live compute(s) along with commit_lsn, and for pageserver to connect if there is a compute regardless of its caughtup-ness.

To recap, the problem is that while there is no new WAL for pageserver it doesn't connect to safekeepers -- and this deadlocks with backpressure acting in while pageserver position is unknown. LogStandbySnapshot is not limited by backpressure and resolves it after some time as it is logged every 15 seconds. Also, if pageserver is not caughtup after start, problem also disappears as it connects anyway, obviously.

Jul 08 '22 19:07 arssher

I've created a separate branch with @aome510 's test here: https://github.com/neondatabase/neon/commits/test-restart-delay I hope you don't mind I've put you as the commit author.

@arssher has suggested that the time.sleep(15) may be necessary not because we want to trigger the background writer before restarting but because we want to sync the pageserver with the compute so it sees no reason to connect to the safekeeper (and update the backpressure information, consequently) after the restart.

For example, if I time.sleep(5), the test passes erroneously, but if I time.sleep(9), it breaks correctly on my machine.

So I altered the test to wait for pageserver instead of 15 seconds. It fails consistently now.

Jul 08 '22 20:07 yeputons

One way to fix this is to modify WalreceiverState::applicable_connection_candidates

That would leave the connection open even if there is nothing to do (currently it still works mostly the same, but this is not an excuse). The better way would be for safekeepers to announce existence of live compute(s) along with commit_lsn, and for pageserver to connect if there is a compute regardless of its caughtup-ness.

Is there anyone actively working on implementing this? I'm not too familiar with the safekeeper code, but can try if needed. I also found that we already have a reliable test for this issue in https://github.com/neondatabase/neon/commits/test-restart-delay.

cc @yeputons @arssher

Jul 25 '22 16:07 aome510

Is there anyone actively working on implementing this? I'm not too familiar with the safekeeper code, but can try if needed.

@aome510 Yes, please. In case of questions feel free to reach out.

Aug 08 '22 10:08 arssher

As far as I understand it is the same problem which I have fixed in https://github.com/neondatabase/postgres/pull/187 In short: backpressure lag is incorrectly calculated because mixing of InvalidXLogRecPtr=0 and UnknownXLogRecPtr=~0. Unfortunately there is still no comment or review on this PR. May be it is better to eliminate UnknownXLogRecPtr at all, which was introduced to make caclulation of min LSN among replicas/safekeepers more straightforward.

Aug 08 '22 13:08 knizhnik

As far as I understand it is the same problem which I have fixed in https://github.com/neondatabase/postgres/pull/187

It looks like all your 3 variants allow to write (disable backpressure) while position of pageserver is unknown. This is a bit dangerous as it makes it possible to ignore backpressure early during the startup, though the window is small and perhaps this is ok. What I proposed instead is to establish safekeeper <-> pageserver connection ASAP after connection from compute appeared by publishing knowledge about compute connections on safekeepers. Even if we don't need this for backpressure, there is one more reason I wanted to add it anyway: if sync-safekeepers waits only for catchup of majority of safekeepers (and it should), then pageserver must be able to immediately switch to one of sks in this majority to provide basebackup faster. Knowing number of compute connections allows to do this without adding risk of flip-flopping during normal work (if there are compute connections, pageserver shouldn't immediately prefer more advanced safekeeper).

Aug 09 '22 06:08 arssher

@yeputons Could you please try again on current main? After https://github.com/neondatabase/postgres/pull/187 merge it should be gone.

Aug 16 '22 07:08 arssher

Gone indeed. poetry run pytest -k 'test_crafted_wal_end' runs each of five tests in about 5 seconds now.

Also, edf4b3ee3cf6350b70f797a906c25733866a1791 (the test from https://github.com/neondatabase/neon/issues/2023#issuecomment-1179351820 ) works

Nov 18 '22 22:11 yeputons

neon
neon copied to clipboard

Simple insert query takes tens of seconds after restart on an interesting WAL

How to reproduce

Details

Logs

Why does this work normally after 15s?

neon neon copied to clipboard

Simple insert query takes tens of seconds after restart on an interesting WAL

How to reproduce

Details

Logs

Why does this work normally after 15s?

neon
neon copied to clipboard