John Spray

Results 144 issues of John Spray

Non-exhaustive list of cases to handle: - Concurrent requests to another endpoint (e.g. delete the tenant while splitting) or the same endpoint (e.g. retries) should be excluded. - Crash during...

a/tech_debt
c/storage

Rare: ``` AssertionError: assert not [ (762, '2024-02-23T01:54:42.414003Z WARN delete_timeline{tenant_id=faa1e715b82ea028c2ab77c827a4e253 shard_id=0000 timeline_id=f8b5e0a4e8c75657837989e9d700addb}: got not found err while removing timeline dir, proceeding anyway timeline_dir="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb" path="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000171F4C1-000000000172BF51"\n'), (763, '2024-02-23T01:54:42.436209Z WARN delete_timeline{tenant_id=faa1e715b82ea028c2ab77c827a4e253 shard_id=0000 timeline_id=f8b5e0a4e8c75657837989e9d700addb}:...

c/storage/pageserver
a/test

## Problem When a tenant creates a new timeline that they will treat as their 'main' history, it is awkward to permanently retain an 'old main' timeline as its ancestor....

c/storage/pageserver
t/tech_design_rfc

## Motivation Enable deploying pageserver sharding into production. Develop the code from https://github.com/neondatabase/neon/pull/6251 into a service we can deploy. ## DoD ## Implementation ideas ```[tasklist] ### Tasks to be able...

t/feature
t/Epic
c/storage

If we lost the storage controller database, then we should be able to recover: all the tenant data is still present in S3. We would have some time: pageserver emergency...

t/feature
c/storage/controller

## Background The gc_feedback mechanism removed in https://github.com/neondatabase/neon/pull/6863 is meant to protect against edge cases where repeated keyspace repartitioning can result in stacks of deltas that are never fully covered...

Sketch of implementation: 1. Extend PageserverFeedback to include shard number & count 2. Update safekeeper structures that store a remote_consistent_lsn to have some type that stores an mapping of shard...

t/feature
c/storage/safekeeper

See RFC #6358 Two recovery paths are needed: - On startup, when we see that some tenant shards have a splitting state - During runtime, when something inside the tenant_shard_split...

t/feature
c/storage/controller

Currently our reconciliation loop has the minimum required behavior: it will try to reconcile, and if a reconciliation fails, it will eventually try again (via the background reconciliation task). For...

t/feature
c/storage/controller

We need a piece of code that sends requests to pageservers in the background to get their latest utilization and implicitly check that they're alive. Later, we may also use...

t/feature
c/storage/controller