neon icon indicating copy to clipboard operation
neon copied to clipboard

test_wal_acceptor_async.py::test_concurrent_computes is flaky

Open petuhovskiy opened this issue 2 years ago • 0 comments

https://github.com/neondatabase/neon/runs/7834746692?check_suite_focus=true

Found this in postgres logs:

2022-08-15 08:46:34.261 GMT [3052] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:18239'
2022-08-15 08:46:34.454 GMT [3052] LOG:  execute __asyncpg_stmt_2731__: INSERT INTO query_log(index, verify_key) VALUES (6, 393221) RETURNING verify_key
2022-08-15 08:46:35.166 GMT [3037] LOG:  connecting with node localhost:18241
2022-08-15 08:46:35.166 GMT [3037] LOG:  connecting with node localhost:18243
2022-08-15 08:46:35.167 GMT [3037] LOG:  connecting with node localhost:18245
2022-08-15 08:46:35.167 GMT [3037] LOG:  connected with node localhost:18241
2022-08-15 08:46:35.168 GMT [3037] FATAL:  WAL acceptor localhost:18241 with term 108 rejects our connection request with term 102
2022-08-15 08:46:35.169 GMT [3030] LOG:  background worker "WAL proposer" (PID 3037) exited with exit code 1
2022-08-15 08:46:40.174 GMT [3578] LOG:  connecting with node localhost:18241
2022-08-15 08:46:40.175 GMT [3578] LOG:  connecting with node localhost:18243
2022-08-15 08:46:40.175 GMT [3578] LOG:  connecting with node localhost:18245

Here we got a higher term from safekeeper, which means that at least one compute is running concurrently. After that, we log FATAL message and restart walproposer process.

This is probably not how it should work, and we should stop all postgres processes and get a new basebackup.

petuhovskiy avatar Aug 15 '22 09:08 petuhovskiy