neon
neon copied to clipboard
Simple insert query takes tens of seconds after restart on an interesting WAL
Found in #1989.
How to reproduce
- Check out commit c25948008aa6795ae66d6727238333fa36a2caae
- Enable postgres logs by adding
pg.config(['log_min_messages=DEBUG2'])
right beforepg.start()
intest_wal_generate.py
- Build everything by
make -j4
- Run a single test:
RUST_LOG=debug poetry run pytest -k 'test_wal_generate[last_wal_record_crossing_segment]'
Note how a simple INSERT INTO keys SELECT generate_series(101, 200)
in the middle of the test takes 10-15 seconds; see test logs.
Details
The test does the following:
- Generates a small amount of data.
2022-07-04 14:00:51.130 INFO [neon_fixtures.py:358] Executing query: CREATE TABLE keys(key int primary key)
2022-07-04 14:00:52.77 INFO [neon_fixtures.py:358] Executing query: INSERT INTO keys SELECT generate_series(1, 100)
2022-07-04 14:00:52.100 INFO [neon_fixtures.py:358] Executing query: SELECT SUM(key) FROM keys
- Generates an interesting WAL by sending special requests to the compute. Specifically, the last WAL record crosses the segment boundary, and there are no WAL records after that.
2022-07-04 14:00:52.142 INFO [neon_fixtures.py:939] Running command "/home/yeputons/neon/neon/target/debug/wal_generate in-existing last_wal_record_crossing_segment host=localhost port=15007 user=cloud_admin dbname=postgres options=-cstatement_timeout=120s"
2022-07-04 14:00:52.142 INFO [neon_fixtures.py:940] Running in "/home/yeputons/neon/neon/test_output/test_wal_generate[last_wal_record_crossing_segment]/repo"
2022-07-04 14:00:54.324 INFO [neon_fixtures.py:964] Run success: end_of_wal = 0/2700278
- Stops pageserver, stops safekeeper. Starts them back.
2022-07-04 14:00:54.325 INFO [test_wal_generate.py:33] Restarting all safekeepers and pageservers
...
2022-07-04 14:00:58.660 INFO [neon_fixtures.py:964] Run success: Starting pageserver at 'localhost:15003' in '/home/yeputons/neon/neon/test_output/test_wal_generate[last_wal_record_crossing_segment]/repo'
Pageserver started
- Does a cold
SELECT
query, which takes some time for some reasons (but that's out of scope for this issue):
2022-07-04 14:00:58.675 INFO [neon_fixtures.py:358] Executing query: SELECT SUM(key) FROM keys
- And the next
INSERT
now takes 10-15 seconds:
2022-07-04 14:01:00.276 INFO [neon_fixtures.py:358] Executing query: INSERT INTO keys SELECT generate_series(101, 200)
2022-07-04 14:01:14.988 INFO [neon_fixtures.py:358] Executing query: SELECT SUM(key) FROM keys
Logs
Reproduces all the time. It seems that the whole system is (dead?)locked.
The compute (pg.log
) thinks there is some backpressure and does not write WAL (sending message len 0
) until it does (sending message len 184
) for some reason:
2022-07-04 11:01:00.289 GMT [25251] DEBUG: current flushLsn 0/2700278 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:00.289 GMT [25251] DEBUG: backpressure throttling: lag 25166456
2022-07-04 11:01:00.299 GMT [25251] DEBUG: current flushLsn 0/2700278 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:00.299 GMT [25251] DEBUG: backpressure throttling: lag 25166456
...
2022-07-04 11:01:04.971 GMT [25144] DEBUG: sending message len 0 beginLsn=0/2700278 endLsn=0/2700278 commitLsn=0/2700278 truncateLsn=0/2700278 to localhost:15005
2022-07-04 11:01:04.971 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: current_timeline_size 0
2022-07-04 11:01:04.971 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: ps_writelsn 0/0
2022-07-04 11:01:04.971 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: ps_flushlsn 0/0
2022-07-04 11:01:04.971 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: ps_applylsn 0/0
2022-07-04 11:01:04.971 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: ps_replytime 710247664971681 reply_time: 2022-07-04 11:01:04.971681+00
2022-07-04 11:01:04.971 GMT [25144] DEBUG: received message term=2 flushLsn=0/2700278 commitLsn=0/2700278 from localhost:15005
2022-07-04 11:01:04.971 GMT [25144] DEBUG: GetLatestZentihFeedback: currentClusterSize 0, ps_writelsn 0/0, ps_flushlsn 0/0, ps_applylsn 0/0, ps_replytime 710247664971681
2022-07-04 11:01:04.976 GMT [25251] DEBUG: current flushLsn 0/2700278 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:04.976 GMT [25251] DEBUG: backpressure throttling: lag 25166456
...
2022-07-04 11:01:14.324 GMT [25251] DEBUG: current flushLsn 0/2700278 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:14.324 GMT [25251] DEBUG: backpressure throttling: lag 25166456
2022-07-04 11:01:14.334 GMT [25251] DEBUG: current flushLsn 0/2700278 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:14.334 GMT [25251] DEBUG: backpressure throttling: lag 25166456
2022-07-04 11:01:14.335 GMT [25144] DEBUG: sending message len 184 beginLsn=0/2700278 endLsn=0/2700330 commitLsn=0/2700278 truncateLsn=0/2700278 to localhost:15005
2022-07-04 11:01:14.335 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: current_timeline_size 0
2022-07-04 11:01:14.335 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: ps_writelsn 0/0
2022-07-04 11:01:14.335 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: ps_flushlsn 0/0
2022-07-04 11:01:14.335 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: ps_applylsn 0/0
2022-07-04 11:01:14.335 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: ps_replytime 710247674335835 reply_time: 2022-07-04 11:01:14.335835+00
2022-07-04 11:01:14.335 GMT [25144] DEBUG: received message term=2 flushLsn=0/2700330 commitLsn=0/2700278 from localhost:15005
2022-07-04 11:01:14.335 GMT [25144] DEBUG: GetLatestZentihFeedback: currentClusterSize 0, ps_writelsn 0/0, ps_flushlsn 0/0, ps_applylsn 0/0, ps_replytime 710247674335835
2022-07-04 11:01:14.336 GMT [25144] DEBUG: write 0/2700330 flush 0/2700330 apply 0/0 reply_time 2022-07-04 11:01:14.336002+00
2022-07-04 11:01:14.336 GMT [25144] DEBUG: sending message len 0 beginLsn=0/2700330 endLsn=0/2700330 commitLsn=0/2700330 truncateLsn=0/2700330 to localhost:15005
2022-07-04 11:01:14.336 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: current_timeline_size 0
2022-07-04 11:01:14.336 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: ps_writelsn 0/0
2022-07-04 11:01:14.336 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: ps_flushlsn 0/0
2022-07-04 11:01:14.336 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: ps_applylsn 0/0
2022-07-04 11:01:14.336 GMT [25144] DEBUG: ParseReplicationFeedbackMessage: ps_replytime 710247674336219 reply_time: 2022-07-04 11:01:14.336219+00
2022-07-04 11:01:14.336 GMT [25144] DEBUG: received message term=2 flushLsn=0/2700330 commitLsn=0/2700278 from localhost:15005
2022-07-04 11:01:14.336 GMT [25144] DEBUG: GetLatestZentihFeedback: currentClusterSize 0, ps_writelsn 0/0, ps_flushlsn 0/0, ps_applylsn 0/0, ps_replytime 710247674336219
2022-07-04 11:01:14.344 GMT [25251] DEBUG: current flushLsn 0/2700330 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:14.344 GMT [25251] DEBUG: backpressure throttling: lag 25166640
2022-07-04 11:01:14.354 GMT [25251] DEBUG: current flushLsn 0/2700330 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:14.354 GMT [25251] DEBUG: backpressure throttling: lag 25166640
The safekeeper does nothing until it receives the WAL to write from the compute:
2022-07-04T11:00:58.971766Z INFO {tid=15}: got query "START_WAL_PUSH postgresql://no_user:@localhost:15003"
2022-07-04T11:00:58.971929Z INFO {tid=15}:{timeline=9a545ef7f96d4e3ba3e60823c5e36e0f}: loading control file ./9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab/safekeeper.control
2022-07-04T11:00:58.972623Z INFO {tid=15}:{timeline=9a545ef7f96d4e3ba3e60823c5e36e0f}: timeline 33760f91928b64a7c62dd431507b92ab restored
2022-07-04T11:00:58.972822Z INFO {tid=15}:{timeline=9a545ef7f96d4e3ba3e60823c5e36e0f}: find_end_of_wal_segment reached zeros at 0/2700278, last records ends at 0/2700278
2022-07-04T11:00:58.972906Z INFO {tid=15}:{timeline=9a545ef7f96d4e3ba3e60823c5e36e0f}: initialized storage for timeline 33760f91928b64a7c62dd431507b92ab, flush_lsn=0/2700278, commit_lsn=0/2700278, peer_horizon_lsn=0/2700278
2022-07-04T11:00:58.973400Z INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: start handshake with wal proposer 127.0.0.1:52096 sysid 7116470666150203840 timeline 0
2022-07-04T11:00:58.973724Z INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: processed greeting from proposer [167, 171, 234, 47, 198, 226, 102, 131, 108, 41, 59, 60, 125, 1, 27, 251], sending term 2
2022-07-04T11:00:58.973852Z INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: timeline 9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab active=true now
2022-07-04T11:00:58.974083Z INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: processed VoteRequest for term 2: VoteResponse { term: 2, vote_given: 0, flush_lsn: 0/2700278, truncate_lsn: 0/2700278, term_history: [(2, 0/1696628)], timeline_start_lsn: 0/1696628 }
2022-07-04T11:00:58.974460Z INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: received ProposerElected ProposerElected { term: 2, start_streaming_at: 0/2700278, term_history: [(2, 0/1696628)], timeline_start_lsn: 0/1696628 }
2022-07-04T11:00:58.977909Z INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: start receiving WAL since 0/2700278
...
2022-07-04T11:01:14.335673Z INFO {tid=15}:WAL acceptor{timeline=33760f91928b64a7c62dd431507b92ab}: restart decoder from 0/0 to 0/2700278
2022-07-04T11:01:14.513067Z INFO {tid=17}: got query "IDENTIFY_SYSTEM"
2022-07-04T11:01:14.516828Z INFO {tid=17}: got query "START_REPLICATION PHYSICAL 0/2700278"
2022-07-04T11:01:14.517410Z INFO {tid=17}:WAL sender{timeline=33760f91928b64a7c62dd431507b92ab}: Start replication from 0/2700278 till None
The pageserver does almost nothing as well:
2022-07-04T11:00:58.652475Z INFO Starting an HTTP endpoint at 127.0.0.1:15004
2022-07-04T11:00:58.656905Z DEBUG parsed 2 headers
2022-07-04T11:00:58.656948Z DEBUG incoming body is empty
2022-07-04T11:00:58.657837Z INFO GET /v1/status 200 OK
2022-07-04T11:00:58.658002Z DEBUG flushed 115 bytes
2022-07-04T11:00:58.659274Z DEBUG read eof
2022-07-04T11:00:58.663641Z DEBUG accepted connection from 127.0.0.1:45588
2022-07-04T11:00:58.663946Z DEBUG Starting thread 'serving Page Service thread'
2022-07-04T11:00:58.664371Z DEBUG SSL requested
2022-07-04T11:00:58.664893Z DEBUG process query "pagestream 9a545ef7f96d4e3ba3e60823c5e36e0f 33760f91928b64a7c62dd431507b92ab"
2022-07-04T11:00:58.678392Z INFO pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/2663 blkno=1 req_lsn=0/16F4190}: directory "/home/yeputons/neon/neon/test_output/test_wal_generate[last_wal_record_crossing_segment]/repo/tenants/9a545ef7f96d4e3ba3e60823c5e36e0f/wal-redo-datadir" exists, removing
2022-07-04T11:00:58.696829Z INFO pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/2663 blkno=1 req_lsn=0/16F4190}: running initdb in "/home/yeputons/neon/neon/test_output/test_wal_generate[last_wal_record_crossing_segment]/repo/tenants/9a545ef7f96d4e3ba3e60823c5e36e0f/wal-redo-datadir"
2022-07-04T11:00:59.416032Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:00:59.416576Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:00.253320Z INFO pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/2663 blkno=1 req_lsn=0/16F4190}: launched WAL redo postgres process on "/home/yeputons/neon/neon/test_output/test_wal_generate[last_wal_record_crossing_segment]/repo/tenants/9a545ef7f96d4e3ba3e60823c5e36e0f/wal-redo-datadir"
2022-07-04T11:01:00.263363Z DEBUG pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/2663 blkno=1 req_lsn=0/16F4190}: postgres applied 2 WAL records in 1584842 us to reconstruct page image at LSN 0/2700278
2022-07-04T11:01:00.272049Z DEBUG pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/16387 blkno=0 req_lsn=0/16F4190}: postgres applied 1 WAL records in 239 us to reconstruct page image at LSN 0/2700278
2022-07-04T11:01:00.428871Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:00.429486Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:01.434604Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:01.434964Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:02.440449Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:02.440940Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:03.446906Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:03.447395Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:04.453535Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:04.453860Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
...
2022-07-04T11:01:13.503325Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:13.503695Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:14.508646Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:14.508943Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:14.509500Z INFO wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(1), wal_producer_connstr: "host=127.0.0.1 port=15005 options='-c ztimelineid=33760f91928b64a7c62dd431507b92ab ztenantid=9a545ef7f96d4e3ba3e60823c5e36e0f'", reason: NoExistingConnection }
2022-07-04T11:01:14.509807Z INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: connecting to host=127.0.0.1 port=15005 options='-c ztimelineid=33760f91928b64a7c62dd431507b92ab ztenantid=9a545ef7f96d4e3ba3e60823c5e36e0f'
2022-07-04T11:01:14.512631Z DEBUG walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: executing simple query: IDENTIFY_SYSTEM
2022-07-04T11:01:14.512812Z INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:safekeeper_handle_db: connected!
2022-07-04T11:01:14.513800Z INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: IdentifySystem { systemid: 7116470666150203840, timeline: 1, xlogpos: 0/2700330, dbname: None }
2022-07-04T11:01:14.514242Z INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: last_record_lsn 0/2700278 starting replication from 0/2700278, server is at 0/2700330...
2022-07-04T11:01:14.514317Z DEBUG walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: executing copy both query START_REPLICATION PHYSICAL 0/2700278
2022-07-04T11:01:14.579771Z INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: caught up at LSN 0/2700330
2022-07-04T11:01:14.580170Z DEBUG walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: zenith_status_update ReplicationFeedback { current_timeline_size: 25395200, ps_writelsn: 40895280, ps_applylsn: 0, ps_flushlsn: 40895096, ps_replytime: SystemTime { tv_sec: 1656932474, tv_nsec: 580141600 } }
It all unlocks when the compute decides to send a non-empty message to the safekeeper at 2022-07-04 11:01:14.335 GMT
, the safekeeper calls write_wal
, which resets the WalDecoder
(which starts reporting LSN greater than zero) at 2022-07-04T11:01:14.335673Z
, and then the pageserver decides to reconnect to the safekeeper at 2022-07-04T11:01:14.509500Z
because apparently there was NoExistingConnection
(why did not it reconnect in the ten last seconds then?)
Thanks for the detailed description. It really helps when investigating the issue.
Just taking a look into this. It seems that this issue happens when we "Restarting all safekeepers and pageservers" after running the WAL generation. Comment the "restart" part out, no large request time happens.
Based on the PG log:
2022-07-04 11:01:14.336 GMT [25144] DEBUG: GetLatestZentihFeedback: currentClusterSize 0, ps_writelsn 0/0, ps_flushlsn 0/0, ps_applylsn 0/0, ps_replytime 710247674336219
2022-07-04 11:01:14.344 GMT [25251] DEBUG: current flushLsn 0/2700330 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:14.344 GMT [25251] DEBUG: backpressure throttling: lag 25166640
Look like upon restarting the pageserver and compute notes, the ZenithFeedback
data stored inside walprop_shared->feedback: WalproposerShmemState
hasn't been initialized, which is the reason for the big LSN lag.
Digging a bit deeper, I found that we only send a ZenithFeedback
from pageserver in
https://github.com/neondatabase/neon/blob/main/pageserver/src/walreceiver/walreceiver_connection.rs#L245-L251
This requires to establish a WAL receiver connection. However, on pageserver restart, etcd doesn't immediately start a new WAL connection. Instead, it spent quite a while processing some update events before starting a new connection candidate.
2022-07-04T11:01:00.253320Z INFO pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/2663 blkno=1 req_lsn=0/16F4190}: launched WAL redo postgres process on "/home/yeputons/neon/neon/test_output/test_wal_generate[last_wal_record_crossing_segment]/repo/tenants/9a545ef7f96d4e3ba3e60823c5e36e0f/wal-redo-datadir"
2022-07-04T11:01:00.263363Z DEBUG pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/2663 blkno=1 req_lsn=0/16F4190}: postgres applied 2 WAL records in 1584842 us to reconstruct page image at LSN 0/2700278
2022-07-04T11:01:00.272049Z DEBUG pagestream{timeline=33760f91928b64a7c62dd431507b92ab tenant=9a545ef7f96d4e3ba3e60823c5e36e0f}:get_page{rel=1663/12974/16387 blkno=0 req_lsn=0/16F4190}: postgres applied 1 WAL records in 239 us to reconstruct page image at LSN 0/2700278
2022-07-04T11:01:00.428871Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:00.429486Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:01.434604Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:01.434964Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:02.440449Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:02.440940Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:03.446906Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:03.447395Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:04.453535Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:04.453860Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
...
2022-07-04T11:01:13.503325Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:13.503695Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:14.508646Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:Connection{peer=Client}: received frame=Data { stream_id: StreamId(1) }
2022-07-04T11:01:14.508943Z DEBUG wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}:etcd_subscription:etcd_broker: Processing 1 events
2022-07-04T11:01:14.509500Z INFO wal_connection_manager{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(1), wal_producer_connstr: "host=127.0.0.1 port=15005 options='-c ztimelineid=33760f91928b64a7c62dd431507b92ab ztenantid=9a545ef7f96d4e3ba3e60823c5e36e0f'", reason: NoExistingConnection }
After switching to a new connection, pageserver will process WALs and send a replication feedback, which will notify the compute node about the actual lags
2022-07-04T11:01:14.514242Z INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: last_record_lsn 0/2700278 starting replication from 0/2700278, server is at 0/2700330...
2022-07-04T11:01:14.514317Z DEBUG walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: executing copy both query START_REPLICATION PHYSICAL 0/2700278
2022-07-04T11:01:14.579771Z INFO walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: caught up at LSN 0/2700330
2022-07-04T11:01:14.580170Z DEBUG walreceiver_connection{id=9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab}: zenith_status_update ReplicationFeedback { current_timeline_size: 25395200, ps_writelsn: 40895280, ps_applylsn: 0, ps_flushlsn: 40895096, ps_replytime: SystemTime { tv_sec: 1656932474, tv_nsec: 580141600 } }
I still don't know the underlying reason for the above behaviour. Maybe someone who is more familiar with etcd code can take a look.
cc @SomeoneToIgnore WDYT?
My wild guess is that some timeout happens that forces etcd to start a new connection, which is the reason why the insert's request time is about 15secs every time.
Most probably, those
Processing 1 events
are the other ones sent by safekeepers, related to its WAL offload.
One could check the events with etcdctl --endpoints=etcd-release.local:2379 watch --prefix neon/9a545ef7f96d4e3ba3e60823c5e36e0f/33760f91928b64a7c62dd431507b92ab
or similar prefix.
There are no explicit timeouts in pageserver in the related code and neither those are intended.
When you see the wal_connection_manager
span logs mentioning subscriptions, it means that the pageserver had successfully subscribed onto a certain key in the etcd and sees something similar to the watch command output above.
When it happens for a certain timeline, pageserver never changes its behaviour, sticking to etcd subscription up until it's cancelled from safekeepers' side or something global such as panics, timeline detach or pageserver shutdown happens.
AFAIK, current safekeeper never explicitly cancels pushes such subscriptions (see its broker.rs), yet their TTL is ~10 seconds and should be updated every second with newer values after SK is restarted, so I don't see how that alone could cause errors.
Another possibility for the
Processing 1 events
to be the very related but ignored is when all safekeepers report some Lsn
of it that's below what pageserver has: this way they are never considered by pageserver as connection candidates.
Last but not least is an exponential backoff that we add when connecting to a given etcd subscription or safekeeper in a loop, but that's not the case judging from the logs, so not sure if related.
I would start from checking etcd messages with etcdctl
to see if etcd is actually working as expected, since there were some concerns about its performance in our workloads.
to be the very related but ignored is when all safekeepers report some Lsn of it that's below what pageserver has: this way they are never considered by pageserver as connection candidates.
I think that's the reason, upon restarting
Look like upon restarting the pageserver and compute notes, the ZenithFeedback data stored inside walprop_shared->feedback: WalproposerShmemState hasn't been initialized, which is the reason for the big LSN lag.
the big lag prevents the compute node from sending any WALs.
Pageserver ignores any new connection request because the commit LSN is no more than its last record:
self.wal_stream_candidates
.iter()
.filter(|(_, etcd_info)| {
etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())
})
...
No new connection, so no ReplicationFeedback
sent to the compute node to initialize the correct write lag. No replication feedback to the compute node means no new WALs (unchanged commit LSN). These things create a dependent cycle/a deadlock but
Why does this work normally after 15s?
Probably, the compute node has some timeout or some background job that will send WAL regardless of the lag value. Notice that in the pg log, after 15 seconds of inactivity, the compute sent a new WAL of len=184
despite the current large write lag:
2022-07-04 11:01:14.324 GMT [25251] DEBUG: backpressure throttling: lag 25166456
2022-07-04 11:01:14.334 GMT [25251] DEBUG: current flushLsn 0/2700278 ReplicationFeedback: write 0/0 flush 0/0 apply 0/0
2022-07-04 11:01:14.334 GMT [25251] DEBUG: backpressure throttling: lag 25166456
2022-07-04 11:01:14.335 GMT [25144] DEBUG: sending message len 184 beginLsn=0/2700278 endLsn=0/2700330 commitLsn=0/2700278 truncateLsn=0/2700278 to localhost:15005
...
SK receives WAL and updates the commit LSN -> page server creates new connection and send a ReplicationFeedback
-> compute node updates the write lag values
One way to fix this is to modify WalreceiverState::applicable_connection_candidates
to accept new candidate with etcd_info.timeline.commit_lsn >= Some(self.local_timeline.get_last_record_lsn())
if page server has no connection. Otherwise, use etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())
.
@SomeoneToIgnore WDYT?
That's a great finding, thank you for looking into that. TBH, this solution feels like a stub to me, fixing the issue with a non-obvious way, but have no better options myself, so might be ok to implement this way, but more comments and unit tests should be added. @LizardWizzard could propose better, I believe 🙂
Why does this work normally after 15s? Probably, the compute node has some timeout or some background job that will send WAL regardless of the lag value.
I dig "a little bit deeper" into this and found that the background job seems to be the background writer's LogStandbySnapshot
:
if (XLogStandbyInfoActive() && !RecoveryInProgress())
{
TimestampTz timeout = 0;
TimestampTz now = GetCurrentTimestamp();
// LOG_SNAPSHOT_INTERVAL_MS = 15s
timeout = TimestampTzPlusMilliseconds(last_snapshot_ts, LOG_SNAPSHOT_INTERVAL_MS);
/*
* Only log if enough time has passed and interesting records have
* been inserted since the last snapshot. Have to compare with <=
* instead of < because GetLastImportantRecPtr() points at the
* start of a record, whereas last_snapshot_lsn points just past
* the end of the record.
*/
if (now >= timeout &&
last_snapshot_lsn <= GetLastImportantRecPtr())
{
last_snapshot_lsn = LogStandbySnapshot();
last_snapshot_ts = now;
}
}
And the behaviour described in this issue can be easily replicated with the below test:
def test_restart_delay(neon_simple_env: NeonEnv):
env = neon_simple_env
env.neon_cli.create_branch('test')
pg = env.postgres.create('test')
# pg.config(['log_min_messages=DEBUG2'])
pg.start()
# general enough WALs to exceed the maximum write lag
pg.safe_psql_many(queries=[
'CREATE TABLE foo(key int primary key)',
'INSERT INTO foo SELECT generate_series(1, 100000)',
])
log.info("sleep 15s to trigger the background writer")
time.sleep(15)
env.pageserver.stop()
env.safekeepers[0].stop()
env.safekeepers[0].start()
env.pageserver.start()
timer = timeit.default_timer()
pg.safe_psql('INSERT INTO foo SELECT generate_series(100001, 100100)')
duration = timeit.default_timer() - timer
# insert 100 rows should not take too much time
assert duration < 2.0
Hmm. Two additional questions bother me:
- Is it correct that there is a part of the compute that sends WAL to Safekeepers ignoring backpressure?
- Why does this startup issue only happen with
XLOG_SWITCH
? I suspect it has something to do with the lack of a partial segment.
- Is it correct that there is a part of the compute that sends WAL to Safekeepers ignoring backpressure?
Yes. The WAL sender isn't affected by backpressure. It sends new WAL record if there exists one. The one affected by backpressure is probably the WAL generating code. I'm not sure on this. Just from my observation adding logs to the WalSndLoop
function. Can someone confirm this behaviour? cc @knizhnik, @hlinnaka.
- Why does this startup issue only happen with XLOG_SWITCH? I suspect it has something to do with the lack of a partial segment.
Not too familiar with XLOG_SWITCH
, so I don't really understand the question. This startup issue happens consistently if we restart the pageserver right before the background writer happens, then the start time for first write request is about 15s (= period of the background writer).
This startup issue happens consistently
Hmm, if I try RUST_LOG=debug poetry run pytest -k 'test_wal_generate[simple]'
, I do not see such behavior, do you? The difference is the shape of WAL generated.
This startup issue happens consistently
Hmm, if I try
RUST_LOG=debug poetry run pytest -k 'test_wal_generate[simple]'
, I do not see such behavior, do you? The difference is the shape of WAL generated.
Oh I was referring to the new test I added above.
The difference is the shape of WAL generated.
One requirement for the startup issue is a specific size of WALs (> 15MB) needs to be accumulated before the restart to trigger the backpressure on startup. I guess that's not the case for "[simple]" test
Is it correct that there is a part of the compute that sends WAL to Safekeepers ignoring backpressure?
Backpressure blocks write transactions. Walsender is not blocked. Autovacuum can set hint bits outside transaction - in this case with log_hint_bits=on autovacuum can produce a lot of WAL and not throttled by backpressure. wal_log_hints is set to true by default but in prod/staging it is set to false (I wonder if we also should change default value).
@aome510 thanks for the investigation.
One way to fix this is to modify
WalreceiverState::applicable_connection_candidates
That would leave the connection open even if there is nothing to do (currently it still works mostly the same, but this is not an excuse). The better way would be for safekeepers to announce existence of live compute(s) along with commit_lsn
, and for pageserver to connect if there is a compute regardless of its caughtup-ness.
To recap, the problem is that while there is no new WAL for pageserver it doesn't connect to safekeepers -- and this deadlocks with backpressure acting in while pageserver position is unknown. LogStandbySnapshot
is not limited by backpressure and resolves it after some time as it is logged every 15 seconds. Also, if pageserver is not caughtup after start, problem also disappears as it connects anyway, obviously.
I've created a separate branch with @aome510 's test here: https://github.com/neondatabase/neon/commits/test-restart-delay I hope you don't mind I've put you as the commit author.
@arssher has suggested that the time.sleep(15)
may be necessary not because we want to trigger the background writer before restarting but because we want to sync the pageserver with the compute so it sees no reason to connect to the safekeeper (and update the backpressure information, consequently) after the restart.
For example, if I time.sleep(5)
, the test passes erroneously, but if I time.sleep(9)
, it breaks correctly on my machine.
So I altered the test to wait for pageserver instead of 15 seconds. It fails consistently now.
One way to fix this is to modify
WalreceiverState::applicable_connection_candidates
That would leave the connection open even if there is nothing to do (currently it still works mostly the same, but this is not an excuse). The better way would be for safekeepers to announce existence of live compute(s) along with
commit_lsn
, and for pageserver to connect if there is a compute regardless of its caughtup-ness.
Is there anyone actively working on implementing this? I'm not too familiar with the safekeeper code, but can try if needed. I also found that we already have a reliable test for this issue in https://github.com/neondatabase/neon/commits/test-restart-delay.
cc @yeputons @arssher
Is there anyone actively working on implementing this? I'm not too familiar with the safekeeper code, but can try if needed.
@aome510 Yes, please. In case of questions feel free to reach out.
As far as I understand it is the same problem which I have fixed in https://github.com/neondatabase/postgres/pull/187
In short: backpressure lag is incorrectly calculated because mixing of InvalidXLogRecPtr=0
and UnknownXLogRecPtr=~0
.
Unfortunately there is still no comment or review on this PR.
May be it is better to eliminate UnknownXLogRecPtr
at all, which was introduced to make caclulation of min LSN among replicas/safekeepers more straightforward.
As far as I understand it is the same problem which I have fixed in https://github.com/neondatabase/postgres/pull/187
It looks like all your 3 variants allow to write (disable backpressure) while position of pageserver is unknown. This is a bit dangerous as it makes it possible to ignore backpressure early during the startup, though the window is small and perhaps this is ok. What I proposed instead is to establish safekeeper <-> pageserver connection ASAP after connection from compute appeared by publishing knowledge about compute connections on safekeepers. Even if we don't need this for backpressure, there is one more reason I wanted to add it anyway: if sync-safekeepers waits only for catchup of majority of safekeepers (and it should), then pageserver must be able to immediately switch to one of sks in this majority to provide basebackup faster. Knowing number of compute connections allows to do this without adding risk of flip-flopping during normal work (if there are compute connections, pageserver shouldn't immediately prefer more advanced safekeeper).
@yeputons Could you please try again on current main? After https://github.com/neondatabase/postgres/pull/187 merge it should be gone.
Gone indeed. poetry run pytest -k 'test_crafted_wal_end'
runs each of five tests in about 5 seconds now.
Also, edf4b3ee3cf6350b70f797a906c25733866a1791 (the test from https://github.com/neondatabase/neon/issues/2023#issuecomment-1179351820 ) works