John Spray
John Spray
Non-exhaustive list of cases to handle: - Concurrent requests to another endpoint (e.g. delete the tenant while splitting) or the same endpoint (e.g. retries) should be excluded. - Crash during...
Rare: ``` AssertionError: assert not [ (762, '2024-02-23T01:54:42.414003Z WARN delete_timeline{tenant_id=faa1e715b82ea028c2ab77c827a4e253 shard_id=0000 timeline_id=f8b5e0a4e8c75657837989e9d700addb}: got not found err while removing timeline dir, proceeding anyway timeline_dir="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb" path="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000171F4C1-000000000172BF51"\n'), (763, '2024-02-23T01:54:42.436209Z WARN delete_timeline{tenant_id=faa1e715b82ea028c2ab77c827a4e253 shard_id=0000 timeline_id=f8b5e0a4e8c75657837989e9d700addb}:...
## Problem When a tenant creates a new timeline that they will treat as their 'main' history, it is awkward to permanently retain an 'old main' timeline as its ancestor....
## Motivation Enable deploying pageserver sharding into production. Develop the code from https://github.com/neondatabase/neon/pull/6251 into a service we can deploy. ## DoD ## Implementation ideas ```[tasklist] ### Tasks to be able...
If we lost the storage controller database, then we should be able to recover: all the tenant data is still present in S3. We would have some time: pageserver emergency...
## Background The gc_feedback mechanism removed in https://github.com/neondatabase/neon/pull/6863 is meant to protect against edge cases where repeated keyspace repartitioning can result in stacks of deltas that are never fully covered...
Sketch of implementation: 1. Extend PageserverFeedback to include shard number & count 2. Update safekeeper structures that store a remote_consistent_lsn to have some type that stores an mapping of shard...
See RFC #6358 Two recovery paths are needed: - On startup, when we see that some tenant shards have a splitting state - During runtime, when something inside the tenant_shard_split...
Currently our reconciliation loop has the minimum required behavior: it will try to reconcile, and if a reconciliation fails, it will eventually try again (via the background reconciliation task). For...
We need a piece of code that sends requests to pageservers in the background to get their latest utilization and implicitly check that they're alive. Later, we may also use...