neon icon indicating copy to clipboard operation
neon copied to clipboard

test_import_from_pageserver_multisegment consistently failed

Open aome510 opened this issue 2 years ago • 3 comments

test_import_from_pageserver_multisegment was added in https://github.com/neondatabase/neon/pull/2172. The test passed in the PR, but after merging into main, it failed consistently with the error:

2022-08-12T04:16:02.4390590Z E   Exception:             Run ['/tmp/neon/bin/neon_local', 'pageserver', 'stop'] failed:
2022-08-12T04:16:02.4391127Z E                 stdout: Stopping pageserver gracefully..............................................................
2022-08-12T04:16:02.4391556Z E                 stderr: 
2022-08-12T04:16:02.4392003Z E   Pageserver connection failed with error: Cannot assign requested address (os error 99)
2022-08-12T04:16:02.4392544Z E   pageserver stop failed: Failed to stop pageserver with pid 6030

Example failed run: https://github.com/neondatabase/neon/runs/7800417580?check_suite_focus=true

Update: test_import_from_pageserver_multisegment is disabled in https://github.com/neondatabase/neon/pull/2258. One of the requirements for this PR is to investigate the failure cause and enable the test back.

aome510 avatar Aug 12 '22 05:08 aome510

My theory: when the pageserver receives SIGTERM, it starts the shutdown sequence, but it doesn't immediately kill GC and/or compaction. They continue to run, and after a large import like in this test, they can take a long time to finish. The timeout on shutdown is 60 s in our tests; if the pageserver doesn't exit in 60s when it receives SIGTERM, the test fails.

The error message with Cannot assign requested address (os error 99) is weird though. I'm not sure why that happens.

hlinnaka avatar Aug 12 '22 18:08 hlinnaka

@bojanserafimov can you take a look at this, after the daemonize issue, please? PR #2261 will probably at least change the error message from this test, as it changes the way we wait for the pageserver shutdown.

hlinnaka avatar Aug 12 '22 18:08 hlinnaka

My theory: when the pageserver receives SIGTERM, it starts the shutdown sequence, but it doesn't immediately kill GC and/or compaction.

This it not a theory and might be triggered very simply, as I've mentioned in the RFC.

I'm not sure if it's really the case here though, but libpq do_gc and checkpoint (that calls forced compaction) calls don't hold file_lock https://github.com/neondatabase/neon/blob/e593cbaabafafb6f54c9ddcd5d1d9f04d1bd4490/pageserver/src/layered_repository.rs#L83-L93

that's used as a semaphore to wait for regular, spawned gc and compaction tasks to stop:

https://github.com/neondatabase/neon/blob/84d1bc06a93de64488a6da7b06c471a9505076e8/pageserver/src/tenant_mgr.rs#L363-L365

SomeoneToIgnore avatar Aug 12 '22 19:08 SomeoneToIgnore