neon Timeout waiting for WAL to arrive in `test_wal

Timeout waiting for WAL to arrive in `test_wal_restore'

Open hlinnaka opened this issue 2 years ago • 2 comments

https://app.circleci.com/pipelines/github/neondatabase/neon/6511/workflows/ab69d61e-09cf-46a4-b603-2aa7d3e19b96/jobs/65792/steps:

2022-05-19T07:47:12.261111Z ERROR pagestream{timeline=2d7fa6256c048af93477fc0d1a6a8839 tenant=b5c7213e03ad45b596d2238fd0e67abd}: error reading relation or page version: Timed out while waiting for WAL record at LSN 0/28E36D0 to arrive, last_record_lsn 0/25F2E10 disk consistent LSN=0/1696628
...

May 19 '22 09:05 hlinnaka

Here's another case where that happened:

https://app.circleci.com/pipelines/github/neondatabase/neon/6484/workflows/bcf1d2d0-d4e0-4425-ae17-b9c940ffacef/jobs/65419/tests

The test seems to be flaky. Let's investigate why.

May 19 '22 09:05 hlinnaka

Another one from me: https://app.circleci.com/pipelines/github/neondatabase/neon/6586/workflows/e88e8a8c-1b7b-49ff-8161-3eca2c3a84a2/jobs/66690

Looks like the earliest weird event is in the compute log:

2022-05-21 01:12:32.543 GMT [14562] FATAL:  canceling authentication due to timeout

Safekeeper goes down immediately after:

2022-05-21T01:12:33.433775Z ERROR {tid=12}: query handler for 'START_WAL_PUSH postgresql://no_user:@localhost:16104' failed: failed to run ReceiveWalConn

followed by pageserver:

2022-05-21T01:12:32.545891Z ERROR pagestream{timeline=42a88bceaf8f85a51a2b2affb63883bf tenant=f3789a9c1e4243e18224a22b6cb2f32f}: error reading relation or page version: Timed out while waiting for WAL record at LSN 0/28E36D0 to arrive, last_record_lsn 0/2754468 disk consistent LSN=0/1696628

and the compute itself:

2022-05-21 01:12:33.431 GMT [14364] LOG:  received fast shutdown request

May it be related to #1068? That issue required no safekeepers, though, while this test uses a safekeeper.

UPD: and a very similar one in a restarted CI.

May 21 '22 01:05 yeputons

Haven't seen this for a while.

Dec 27 '22 15:12 arssher

neon neon copied to clipboard

Timeout waiting for WAL to arrive in `test_wal_restore'

neon
neon copied to clipboard