neon
neon copied to clipboard
test_timeline_physical_size_post_compaction is flaky
2022-08-03T14:59:43.4397932Z _________________ test_timeline_physical_size_post_compaction __________________
2022-08-03T14:59:43.4401373Z [gw0] linux -- Python 3.9.2 /github/home/.cache/pypoetry/virtualenvs/zenith-_pxWMzVK-py3.9/bin/python
2022-08-03T14:59:43.4402042Z test_runner/batch_others/test_timeline_size.py:237: in test_timeline_physical_size_post_compaction
2022-08-03T14:59:43.4402561Z assert_physical_size(env, env.initial_tenant, new_timeline_id)
2022-08-03T14:59:43.4403036Z test_runner/batch_others/test_timeline_size.py:338: in assert_physical_size
2022-08-03T14:59:43.4403502Z assert res["local"]["current_physical_size"] == res["local"][
2022-08-03T14:59:43.4403814Z E assert 42999808 == 45793280
2022-08-03T14:59:43.4404257Z E +42999808
2022-08-03T14:59:43.4404539Z E -45793280
Got it for unrelated PR https://github.com/neondatabase/neon/pull/2210
@aome510 could you please take a look?
Got it again https://github.com/neondatabase/neon/runs/7723174729?check_suite_focus=true
I failed to replicate the issue both in #2235 and locally. Instead, I got several other flaky tests:
test_tenant_config: https://github.com/neondatabase/neon/runs/7740489518?check_suite_focus=true https://github.com/neondatabase/neon/runs/7740120513?check_suite_focus=true
test_pageserver_restart: https://github.com/neondatabase/neon/runs/7740425335?check_suite_focus=true https://github.com/neondatabase/neon/runs/7740118966?check_suite_focus=true
test_tenants_many[RemoteStorageKind.REAL_S3]: https://github.com/neondatabase/neon/runs/7740424684?check_suite_focus=true https://github.com/neondatabase/neon/runs/7740121190?check_suite_focus=true https://github.com/neondatabase/neon/runs/7739496537?check_suite_focus=true https://github.com/neondatabase/neon/runs/7758820107?check_suite_focus=true
test_race_conditions: https://github.com/neondatabase/neon/runs/7739496574?check_suite_focus=true https://github.com/neondatabase/neon/runs/7758969477?check_suite_focus=true
Update: Found one for test_timeline_physical_size_post_compaction: https://github.com/neondatabase/neon/runs/7758774331?check_suite_focus=true
I got another failure like this: https://github.com/neondatabase/neon/actions/runs/3269326640/jobs/5377192508
test_runner/regress/test_timeline_size.py:286: in test_timeline_physical_size_post_compaction
assert_physical_size(env, env.initial_tenant, new_timeline_id)
test_runner/regress/test_timeline_size.py:451: in assert_physical_size
assert res["current_physical_size"] == res["current_physical_size_non_incremental"]
E assert 50814976 == 50962432
E +50814976
E -50962432
Looking at the tests I don't think the problem was fully fixed. The test does this:
pg.safe_psql_many(
[
"CREATE TABLE foo (t text)",
"""INSERT INTO foo
SELECT 'long string to consume some space' || g
FROM generate_series(1, 100000) g""",
]
)
wait_for_last_flush_lsn(env, pg, env.initial_tenant, new_timeline_id)
pageserver_http.timeline_checkpoint(env.initial_tenant, new_timeline_id)
pageserver_http.timeline_compact(env.initial_tenant, new_timeline_id)
assert_physical_size(env, env.initial_tenant, new_timeline_id)
The assumption here is that no new layer are created, while assert_physical_size
is fetching the incremental and non-incremental sizes. That's not true, though. PostreSQL is still running, and autovacuum can kick in at any time, write WAL, and when enough WAL is received, a new layer is created. I was able to reproduce that by adding a delay between getting the incremental and non-incremental physical sizes:
diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs
index 91a385bf..1c8f5d70 100644
--- a/pageserver/src/http/routes.rs
+++ b/pageserver/src/http/routes.rs
@@ -129,6 +129,9 @@ async fn build_timeline_info(
}
};
let current_physical_size = Some(timeline.get_physical_size());
+ if include_non_incremental_physical_size {
+ std::thread::sleep(std::time::Duration::from_millis(30000));
+ }
let info = TimelineInfo {
tenant_id: timeline.tenant_id,