neon
neon copied to clipboard
test_timeline_deletion_with_files_stuck_in_upload_queue flakiness
Rare:
AssertionError: assert not [
(762, '2024-02-23T01:54:42.414003Z WARN delete_timeline{tenant_id=faa1e715b82ea028c2ab77c827a4e253 shard_id=0000 timeline_id=f8b5e0a4e8c75657837989e9d700addb}: got not found err while removing timeline dir, proceeding anyway timeline_dir="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb" path="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000171F4C1-000000000172BF51"\n'),
(763, '2024-02-23T01:54:42.436209Z WARN delete_timeline{tenant_id=faa1e715b82ea028c2ab77c827a4e253 shard_id=0000 timeline_id=f8b5e0a4e8c75657837989e9d700addb}: got not found err while removing timeline dir, proceeding anyway timeline_dir="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb" path="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696321-00000000016A5D69"\n'),
(764, '2024-02-23T01:54:42.445635Z WARN delete_timeline{tenant_id=faa1e715b82ea028c2ab77c827a4e253 shard_id=0000 timeline_id=f8b5e0a4e8c75657837989e9d700addb}: got not found err while removing timeline dir, proceeding anyway timeline_dir="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb" path="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000175F0B9-000000000176A2F1"\n')]
Unsure what that could be. Only example I've seen so far. Analysis in: https://github.com/neondatabase/neon/issues/6681#issuecomment-1961568118
Common:
AssertionError: assert not [
(866, '2024-02-22T17:26:05.355705Z ERROR request{method=PUT path=/v1/tenant/01f38f79d16faf4f1caa45fe1fbed6da/timeline/b2d0a3673159e19a88e51f564cfac803/checkpoint request_id=3e7ffcba-0384-4464-878e-3593c876cb94}: Error processing HTTP request: InternalServerError(queue is in state Stopping\n')]
this sounds like an error message which was changed. Fixed in #6894.
Looks like a Stopping/Stopped string error, which I introduced in https://github.com/neondatabase/neon/commit/8dee9908f83fdebea1dfd36304272bdbe684ad5c -- cannot see why. I'll just switch it back.
The more rare case is a valid situation, which happens when the struct Layer drops happen at the same time as timeline deletion cleaning up local layer files -- perhaps now the walkdir actually has an upper hand because it is sync code, vs. struct Layer using spawn_blocking, but it is a race.
The individual layers have no knowledge of deletion happening and they were being kept alive by UploadTask entries in RemoteTimelineClient. I think the answer to this is to hold the gate a bit more, and be sure to hold the guard until the end of deletion.
Something to consider after #6028.
Next steps:
- layers need to have a single gateguard (not necessarily gateguard per layer) so that we can synchronize the shutdown
The rare case is likely handled by #7082.
A recent failure: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6407/8295888842/index.html#suites/837740b64a53e769572c4ed7b7a7eeeb/fbe892c34b63cdb
A recent failure
I think it was just being slow, pageserver logs do not appear to have any long stuckness.. But this may have been with the recently-made-assertions.
This has no longer been flaky since the Timeline::gate usage introduction in #7082: last flaky was 2024-03-16. However that work was merged before, so unsure.