neon
neon copied to clipboard
safekeeper: efficient shard catchup for WAL cursor fan-out
After #9337, when shards restart and need to catch up on old WAL, each shard will pull WAL records from S3 and filter them. This results in O(catchup_ranges) work. We should do this work once across multiple shards, since we expect many shards to require catchup at roughly the same time.
TODO: details post-RFC.
Consider gossiping timeline progress between safekeepers to know how many shards are offline/lagging.
Consider memory budgeting. Simple approach: estimate timeline catchup volume from LSNs, acquire from semaphore, block when unavailable. Consider QoS to prioritize "important" tenants.