test_import_from_pageserver_multisegment consistently failed
test_import_from_pageserver_multisegment was added in https://github.com/neondatabase/neon/pull/2172. The test passed in the PR, but after merging into main, it failed consistently with the error:
2022-08-12T04:16:02.4390590Z E Exception: Run ['/tmp/neon/bin/neon_local', 'pageserver', 'stop'] failed:
2022-08-12T04:16:02.4391127Z E stdout: Stopping pageserver gracefully..............................................................
2022-08-12T04:16:02.4391556Z E stderr:
2022-08-12T04:16:02.4392003Z E Pageserver connection failed with error: Cannot assign requested address (os error 99)
2022-08-12T04:16:02.4392544Z E pageserver stop failed: Failed to stop pageserver with pid 6030
Example failed run: https://github.com/neondatabase/neon/runs/7800417580?check_suite_focus=true
Update: test_import_from_pageserver_multisegment is disabled in https://github.com/neondatabase/neon/pull/2258. One of the requirements for this PR is to investigate the failure cause and enable the test back.
My theory: when the pageserver receives SIGTERM, it starts the shutdown sequence, but it doesn't immediately kill GC and/or compaction. They continue to run, and after a large import like in this test, they can take a long time to finish. The timeout on shutdown is 60 s in our tests; if the pageserver doesn't exit in 60s when it receives SIGTERM, the test fails.
The error message with Cannot assign requested address (os error 99) is weird though. I'm not sure why that happens.
@bojanserafimov can you take a look at this, after the daemonize issue, please? PR #2261 will probably at least change the error message from this test, as it changes the way we wait for the pageserver shutdown.
My theory: when the pageserver receives SIGTERM, it starts the shutdown sequence, but it doesn't immediately kill GC and/or compaction.
This it not a theory and might be triggered very simply, as I've mentioned in the RFC.
I'm not sure if it's really the case here though, but libpq do_gc and checkpoint (that calls forced compaction) calls don't hold file_lock
https://github.com/neondatabase/neon/blob/e593cbaabafafb6f54c9ddcd5d1d9f04d1bd4490/pageserver/src/layered_repository.rs#L83-L93
that's used as a semaphore to wait for regular, spawned gc and compaction tasks to stop:
https://github.com/neondatabase/neon/blob/84d1bc06a93de64488a6da7b06c471a9505076e8/pageserver/src/tenant_mgr.rs#L363-L365