neon timeline_delete failed in CI

https://github.com/neondatabase/neon/actions/runs/3050749282/jobs/4918341001

2022-09-14T07:26:53.0771396Z =================================== FAILURES ===================================
2022-09-14T07:26:53.0771792Z ______________________________ test_tenant_tasks _______________________________
2022-09-14T07:26:53.0772619Z [gw3] linux -- Python 3.9.2 /github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/bin/python
2022-09-14T07:26:53.0773143Z test_runner/fixtures/neon_fixtures.py:956: in verbose_error
2022-09-14T07:26:53.0773552Z     res.raise_for_status()
2022-09-14T07:26:53.0774239Z /github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/lib/python3.9/site-packages/requests/models.py:1021: in raise_for_status
2022-09-14T07:26:53.0774735Z     raise HTTPError(http_error_msg, response=self)
2022-09-14T07:26:53.0775482Z E   requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:18136/v1/tenant/43c00ec5d0cc7db1917fffd5005fab56/timeline/d1120173207bd7ad7c1f5d88f0677b3d
2022-09-14T07:26:53.0775966Z 
2022-09-14T07:26:53.0776217Z The above exception was the direct cause of the following exception:
2022-09-14T07:26:53.0776637Z test_runner/regress/test_tenant_tasks.py:60: in test_tenant_tasks
2022-09-14T07:26:53.0777068Z     delete_all_timelines(tenant_id)
2022-09-14T07:26:53.0777530Z test_runner/regress/test_tenant_tasks.py:41: in delete_all_timelines
2022-09-14T07:26:53.0777870Z     client.timeline_delete(tenant, t)
2022-09-14T07:26:53.0778242Z test_runner/fixtures/neon_fixtures.py:1059: in timeline_delete
2022-09-14T07:26:53.0778591Z     self.verbose_error(res)
2022-09-14T07:26:53.0778907Z test_runner/fixtures/neon_fixtures.py:962: in verbose_error
2022-09-14T07:26:53.0779312Z     raise NeonPageserverApiException(msg) from e
2022-09-14T07:26:53.0779955Z E   fixtures.neon_fixtures.NeonPageserverApiException: cannot lock compaction critical section try_lock failed because the operation would block

Sep 14 '22 08:09 hlinnaka

The "wait for all timeline tasks to finish" logic is a bit broken with GC and compaction. We have this:

pub async fn delete_timeline(tenant_id: TenantId, timeline_id: TimelineId) -> anyhow::Result<()> {
    // Start with the shutdown of timeline tasks (this shuts down the walreceiver)
    // It is important that we do not take locks here, and do not check whether the timeline exists
    // because if we hold tenants_state::write_tenants() while awaiting for the tasks to join
    // we cannot create new timelines and tenants, and that can take quite some time,
    // it can even become stuck due to a bug making whole pageserver unavailable for some operations
    // so this is the way how we deal with concurrent delete requests: shutdown everythig, wait for confirmation
    // and then try to actually remove timeline from inmemory state and this is the point when concurrent requests
    // will synchronize and either fail with the not found error or succeed

    debug!("waiting for wal receiver to shutdown");
    task_mgr::shutdown_tasks(
        Some(TaskKind::WalReceiverManager),
        Some(tenant_id),
        Some(timeline_id),
    )
    .await;
    debug!("wal receiver shutdown confirmed");

    info!("waiting for timeline tasks to shutdown");
    task_mgr::shutdown_tasks(None, Some(tenant_id), Some(timeline_id)).await;
    info!("timeline task shutdown completed");
    match get_tenant(tenant_id, true) {
        Ok(tenant) => {
            tenant.delete_timeline(timeline_id)?;
            if tenant.list_timelines().is_empty() {
                tenant.activate(false);
            }
        }
        Err(e) => anyhow::bail!("Cannot access tenant {tenant_id} in local tenant state: {e:?}"),
    }

    Ok(())
}

That's supposed to kill and wait for all tasks operating on the timeline. However, GC and compaction tasks are only associated with the tenant, not a particular timeline, so the above code won't shut them down.

Sep 14 '22 18:09 hlinnaka

Its expected behavior. cannot lock compaction critical section try_lock failed because the operation would block Delete waits for gc and compaction to go to sleep by trying to lock a mutex. It does so to avoid corrupting their state by deleting files when it is not expected to happen. If gc or compaction are currently running it will return an error so request supposed to be retried. This is not implemented in our test client, but is implemented in console. I'll add more comments to explain it better. Because error message is indeed quite puzzling

Sep 14 '22 18:09 LizardWizzard

Got apparently the same in test_tenant_tasks. https://github.com/neondatabase/neon/actions/runs/3345492496/jobs/5541326216

Oct 28 '22 13:10 arssher

Hit this with a test I am trying to create for #2714 locally as I restart the pageserver to delete timelines. I assume there's a startup time compaction running, so retry loop should work for my case.

Nov 01 '22 19:11 koivunej

Another flakiness: https://github.com/neondatabase/neon/actions/runs/3500291531/jobs/5863508678

____________________ ERROR at teardown of test_tenant_tasks ____________________
[gw0] linux -- Python 3.9.2 /github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/bin/python
/github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/lib/python3.9/site-packages/allure_commons/_allure.py:200: in __call__
    return self._fixture_function(*args, **kwargs)
test_runner/fixtures/neon_fixtures.py:1043: in neon_env_builder
    yield builder
test_runner/fixtures/neon_fixtures.py:785: in __exit__
    self.env.pageserver.assert_no_errors()
test_runner/fixtures/neon_fixtures.py:1829: in assert_no_errors
    assert not errors
E   AssertionError: assert not ['2022-11-18T22:31:01.667460Z ERROR Error processing HTTP request: InternalServerError(cannot lock compaction critical section try_lock failed because the operation would block\n']

Nov 18 '22 22:11 yeputons

I believe this is fixed. Checked with flaky tests database for 60 days to confirm

Aug 14 '23 15:08 LizardWizzard

neon neon copied to clipboard

timeline_delete failed in CI

neon
neon copied to clipboard