neon
neon copied to clipboard
timeline_delete failed in CI
https://github.com/neondatabase/neon/actions/runs/3050749282/jobs/4918341001
2022-09-14T07:26:53.0771396Z =================================== FAILURES ===================================
2022-09-14T07:26:53.0771792Z ______________________________ test_tenant_tasks _______________________________
2022-09-14T07:26:53.0772619Z [gw3] linux -- Python 3.9.2 /github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/bin/python
2022-09-14T07:26:53.0773143Z test_runner/fixtures/neon_fixtures.py:956: in verbose_error
2022-09-14T07:26:53.0773552Z res.raise_for_status()
2022-09-14T07:26:53.0774239Z /github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/lib/python3.9/site-packages/requests/models.py:1021: in raise_for_status
2022-09-14T07:26:53.0774735Z raise HTTPError(http_error_msg, response=self)
2022-09-14T07:26:53.0775482Z E requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:18136/v1/tenant/43c00ec5d0cc7db1917fffd5005fab56/timeline/d1120173207bd7ad7c1f5d88f0677b3d
2022-09-14T07:26:53.0775966Z
2022-09-14T07:26:53.0776217Z The above exception was the direct cause of the following exception:
2022-09-14T07:26:53.0776637Z test_runner/regress/test_tenant_tasks.py:60: in test_tenant_tasks
2022-09-14T07:26:53.0777068Z delete_all_timelines(tenant_id)
2022-09-14T07:26:53.0777530Z test_runner/regress/test_tenant_tasks.py:41: in delete_all_timelines
2022-09-14T07:26:53.0777870Z client.timeline_delete(tenant, t)
2022-09-14T07:26:53.0778242Z test_runner/fixtures/neon_fixtures.py:1059: in timeline_delete
2022-09-14T07:26:53.0778591Z self.verbose_error(res)
2022-09-14T07:26:53.0778907Z test_runner/fixtures/neon_fixtures.py:962: in verbose_error
2022-09-14T07:26:53.0779312Z raise NeonPageserverApiException(msg) from e
2022-09-14T07:26:53.0779955Z E fixtures.neon_fixtures.NeonPageserverApiException: cannot lock compaction critical section try_lock failed because the operation would block
The "wait for all timeline tasks to finish" logic is a bit broken with GC and compaction. We have this:
pub async fn delete_timeline(tenant_id: TenantId, timeline_id: TimelineId) -> anyhow::Result<()> {
// Start with the shutdown of timeline tasks (this shuts down the walreceiver)
// It is important that we do not take locks here, and do not check whether the timeline exists
// because if we hold tenants_state::write_tenants() while awaiting for the tasks to join
// we cannot create new timelines and tenants, and that can take quite some time,
// it can even become stuck due to a bug making whole pageserver unavailable for some operations
// so this is the way how we deal with concurrent delete requests: shutdown everythig, wait for confirmation
// and then try to actually remove timeline from inmemory state and this is the point when concurrent requests
// will synchronize and either fail with the not found error or succeed
debug!("waiting for wal receiver to shutdown");
task_mgr::shutdown_tasks(
Some(TaskKind::WalReceiverManager),
Some(tenant_id),
Some(timeline_id),
)
.await;
debug!("wal receiver shutdown confirmed");
info!("waiting for timeline tasks to shutdown");
task_mgr::shutdown_tasks(None, Some(tenant_id), Some(timeline_id)).await;
info!("timeline task shutdown completed");
match get_tenant(tenant_id, true) {
Ok(tenant) => {
tenant.delete_timeline(timeline_id)?;
if tenant.list_timelines().is_empty() {
tenant.activate(false);
}
}
Err(e) => anyhow::bail!("Cannot access tenant {tenant_id} in local tenant state: {e:?}"),
}
Ok(())
}
That's supposed to kill and wait for all tasks operating on the timeline. However, GC and compaction tasks are only associated with the tenant, not a particular timeline, so the above code won't shut them down.
Its expected behavior. cannot lock compaction critical section try_lock failed because the operation would block
Delete waits for gc and compaction to go to sleep by trying to lock a mutex. It does so to avoid corrupting their state by deleting files when it is not expected to happen. If gc or compaction are currently running it will return an error so request supposed to be retried. This is not implemented in our test client, but is implemented in console. I'll add more comments to explain it better. Because error message is indeed quite puzzling
Got apparently the same in test_tenant_tasks. https://github.com/neondatabase/neon/actions/runs/3345492496/jobs/5541326216
Hit this with a test I am trying to create for #2714 locally as I restart the pageserver to delete timelines. I assume there's a startup time compaction running, so retry loop should work for my case.
Another flakiness: https://github.com/neondatabase/neon/actions/runs/3500291531/jobs/5863508678
____________________ ERROR at teardown of test_tenant_tasks ____________________
[gw0] linux -- Python 3.9.2 /github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/bin/python
/github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/lib/python3.9/site-packages/allure_commons/_allure.py:200: in __call__
return self._fixture_function(*args, **kwargs)
test_runner/fixtures/neon_fixtures.py:1043: in neon_env_builder
yield builder
test_runner/fixtures/neon_fixtures.py:785: in __exit__
self.env.pageserver.assert_no_errors()
test_runner/fixtures/neon_fixtures.py:1829: in assert_no_errors
assert not errors
E AssertionError: assert not ['2022-11-18T22:31:01.667460Z ERROR Error processing HTTP request: InternalServerError(cannot lock compaction critical section try_lock failed because the operation would block\n']
I believe this is fixed. Checked with flaky tests database for 60 days to confirm