neon icon indicating copy to clipboard operation
neon copied to clipboard

safekeeper: efficient shard catchup for WAL cursor fan-out

Open erikgrinaker opened this issue 1 year ago • 0 comments

After #9337, when shards restart and need to catch up on old WAL, each shard will pull WAL records from S3 and filter them. This results in O(catchup_ranges) work. We should do this work once across multiple shards, since we expect many shards to require catchup at roughly the same time.

TODO: details post-RFC.

Consider gossiping timeline progress between safekeepers to know how many shards are offline/lagging.

Consider memory budgeting. Simple approach: estimate timeline catchup volume from LSNs, acquire from semaphore, block when unavailable. Consider QoS to prioritize "important" tenants.

erikgrinaker avatar Oct 09 '24 15:10 erikgrinaker