neon
neon copied to clipboard
docs: sharding phase 1 RFC
We need to shard our Tenants to support larger databases without those large databases dominating our pageservers and/or requiring dedicated pageservers.
This RFC aims to define an initial capability that will permit creating large-capacity databases using a static configuration defined at time of Tenant creation.
Online re-sharding is deferred as future work, as is offloading layers for historical reads. However, both of these capabilities would be implementable without further changes to the control plane or compute: this RFC aims to define the cross-component work needed to bootstrap sharding end-to-end.
2652 tests run: 2528 passed, 0 failed, 124 skipped (full report)
Code coverage* (full report)
functions:28.4% (7031 of 24721 functions)lines:47.2% (43454 of 92101 lines)
* collected from Rust tests only
be81eaaed5172fc89711b5bd51c720b233772e55 at 2024-03-14T11:04:10.157Z :recycle:
I also wonder if the initial idea with the safekeepers is actually better:
- for starters we can just send everything to all shards. That will lead to our underutilized network interfaces to be less underutilized but don't think it would be a big deal short-term.
- that still leaves us with the problem of one shard holding old wal and not receiving new updates. But here I think we overstate importance of that problem and can allow such WAL segments to be offloaded to S3. Later on we can retrieve that segment on demand.
- SK team already was considering switching current stream-per-connection model to something multiplexed with messages send over shared wire. GRPC looked like the most actionable thing there. So we anyway want to change protocol.
- Also not concerned with the WAL decoding performance -- we only need pagerefs, so we could do stream parsing with minimal state (3-4 variables). Given that all that stream is written to disk pretty sure that such parsing would not be a bottleneck.
So I'd rather start with "all wal to all shards" and then went through that list async without blocking initial sharding.
I also wonder if the initial idea with the safekeepers is actually better:
- for starters we can just send everything to all shards. That will lead to our underutilized network interfaces to be less underutilized but don't think it would be a big deal short-term.
- that still leaves us with the problem of one shard holding old wal and not receiving new updates. But here I think we overstate importance of that problem and can allow such WAL segments to be offloaded to S3. Later on we can retrieve that segment on demand.
- SK team already was considering switching current stream-per-connection model to something multiplexed with messages send over shared wire. GRPC looked like the most actionable thing there. So we anyway want to change protocol.
- Also not concerned with the WAL decoding performance -- we only need pagerefs, so we could do stream parsing with minimal state (3-4 variables). Given that all that stream is written to disk pretty sure that such parsing would not be a bottleneck.
So I'd rather start with "all wal to all shards" and then went through that list async without blocking initial sharding.
I am still not quite sure about whether scattering WAL is good idea. The drawbacks are clear - you explained why them may be not so critical. May be... But what are the advantages? Reducing network traffic? It is a bottleneck?
Also, as I already wrote, in principle when we are sending the same stream to multiple consumers we can use low level broadcast. Which seems to be much more efficient than scattering. Not sure if it is applicable to safekeepers<->PS communication (at least because PS are not synced and may have different positions in the stream).
Concerning decoding - there are are some WAL commands which modify pages not specified in pagerefs: i.e. *ALL_VISIBLE_CLEARED flags in HEAP INSERT/UPDATE/DELETE commands. It actually raise more serious problem than just WAL scattering. Current sharding schema assumes that we ignore forknum, but use blocknum. But definitely related pages in main and VM forks have different block numbers. So them can be assigned to different shards. So walingest should take it in account and do not try to insert in KV storage changes not relevant to this shard.
Sorry if I missed something, but I failed to find answer for one question related to key sharding in this RFC. How we are going to handle relation size? Right now relation size is maintained in in separate key-value pair in KV storage. It is not explicitly set by SMGR. It is updated on PS we we page image or wal record is appended to the relation.
So to which shard compute should send get_rel_size request? Looks like the only working solution is to broadcast this request to all shards and then choose maximum. But it seems to be quite inefficient... Certainly we have relsize cache at compute, so relation size is not expected to be retrieved so frequently. But relation size is requested at node startup. Actually one of the primary goals of shards was to reduced startup time. And here we need to broadcast get_rel_size requests to all shards which definitely will not speedup startup.
One more key sharding issue not covered by this RFC. VM fork is updated either by special WAL records, either as part of heap insert/update/delete operations (when correspondent bit is set). The problem is that block number of the updated VM fork page is very different from block number in the main fork specified in heap_* record. So even if we are broadcasting the same WAL stream to all shards, we still wan to apply only those WAL records which are associated with the particular shard.
But for FSM/VM fork updates it is not so trivial to do. Their buffer tags are not specified in record's blocks data. We have to decode WAL record, check bit (if it changes page visibility) then calculate position of VM page and check if it is assigned to this shard and if so - update it.
Task becomes more challenged if we are going to scatter WAL. In this case we should look for this VM flags and broadcast record in this case.
I've cleaned this up to cover what we did in Q4 -- will publish a smaller follow-on RFC that describes shard splitting (Q1).