neon
neon copied to clipboard
Timeout waiting for WAL to arrive in `test_wal_restore'
https://app.circleci.com/pipelines/github/neondatabase/neon/6511/workflows/ab69d61e-09cf-46a4-b603-2aa7d3e19b96/jobs/65792/steps:
2022-05-19T07:47:12.261111Z ERROR pagestream{timeline=2d7fa6256c048af93477fc0d1a6a8839 tenant=b5c7213e03ad45b596d2238fd0e67abd}: error reading relation or page version: Timed out while waiting for WAL record at LSN 0/28E36D0 to arrive, last_record_lsn 0/25F2E10 disk consistent LSN=0/1696628
...
Here's another case where that happened:
https://app.circleci.com/pipelines/github/neondatabase/neon/6484/workflows/bcf1d2d0-d4e0-4425-ae17-b9c940ffacef/jobs/65419/tests
The test seems to be flaky. Let's investigate why.
Another one from me: https://app.circleci.com/pipelines/github/neondatabase/neon/6586/workflows/e88e8a8c-1b7b-49ff-8161-3eca2c3a84a2/jobs/66690
Looks like the earliest weird event is in the compute log:
2022-05-21 01:12:32.543 GMT [14562] FATAL: canceling authentication due to timeout
Safekeeper goes down immediately after:
2022-05-21T01:12:33.433775Z ERROR {tid=12}: query handler for 'START_WAL_PUSH postgresql://no_user:@localhost:16104' failed: failed to run ReceiveWalConn
followed by pageserver:
2022-05-21T01:12:32.545891Z ERROR pagestream{timeline=42a88bceaf8f85a51a2b2affb63883bf tenant=f3789a9c1e4243e18224a22b6cb2f32f}: error reading relation or page version: Timed out while waiting for WAL record at LSN 0/28E36D0 to arrive, last_record_lsn 0/2754468 disk consistent LSN=0/1696628
and the compute itself:
2022-05-21 01:12:33.431 GMT [14364] LOG: received fast shutdown request
May it be related to #1068? That issue required no safekeepers, though, while this test uses a safekeeper.
UPD: and a very similar one in a restarted CI.
Haven't seen this for a while.