polkadot-sdk
polkadot-sdk copied to clipboard
Add availability-recovery from systematic chunks
Don't look at the commit history, it's confusing, as this branch is based on another branch that was merged
Fixes #598 Also implements RFC #47
Description
- Availability-recovery now first attempts to request the systematic chunks for large POVs (which are the first ~n/3 chunks, which can recover the full data without doing the costly reed-solomon decoding process). This has a fallback of recovering from all chunks, if for some reason the process fails. Additionally, backers are also used as a backup for requesting the systematic chunks if the assigned validator is not offering the chunk (each backer is only used for one systematic chunk, to not overload them).
- Quite obviously, recovering from systematic chunks is much faster than recovering from regular chunks (4000% faster as measured on my apple M2 Pro).
- Introduces a
ValidatorIndex->ChunkIndexmapping which is different for every core, in order to avoid only querying the first n/3 validators over and over again in the same session. The mapping is the one described in RFC 47. - The mapping is feature-gated by the NodeFeatures runtime API so that it can only be enabled via a governance call once a sufficient majority of validators have upgraded their client. If the feature is not enabled, the mapping will be the identity mapping and backwards-compatibility will be preserved.
- Adds a new chunk request protocol version (v2), which adds the ChunkIndex to the response. This may or may not be checked against the expected chunk index. For av-distribution and systematic recovery, this will be checked, but for regular recovery, no. This is backwards compatible. First, a v2 request is attempted. If that fails during protocol negotiation, v1 is used.
- Systematic recovery is only attempted during approval-voting, where we have easy access to the core_index. For disputes and collator pov_recovery, regular chunk requests are used, just as before.
Performance results
Some results from subsystem-bench:
with regular chunk recovery: CPU usage per block 39.82s with recovery from backers: CPU usage per block 16.03s with systematic recovery: CPU usage per block 19.07s
End-to-end results here: https://github.com/paritytech/polkadot-sdk/issues/598#issuecomment-1792007099
TODO:
- [x] RFC #47
- [x] merge https://github.com/paritytech/polkadot-sdk/pull/2177 and rebase on top of those changes
- [x] merge https://github.com/paritytech/polkadot-sdk/pull/2771 and rebase
- [x] add tests
- [x] preliminary performance measure on Versi: see https://github.com/paritytech/polkadot-sdk/issues/598#issuecomment-1792007099
- [x] Rewrite the implementer's guide documentation
- [x] https://github.com/paritytech/polkadot-sdk/pull/3065
- [x] https://github.com/paritytech/zombienet/issues/1705 and fix zombienet tests
- [x] security audit
- [ ] implement https://github.com/paritytech/polkadot-sdk/issues/3127 as a stacked PR
- [ ] final versi test and performance measure
Need to rebase on top of https://github.com/paritytech/polkadot-sdk/pull/1457, it currently contains changes from both PRs
merged latest master into the branch and the diff looks good now.
Note that the commit history will look confusing, and appears to contain commits from both this PR and the previous one (#1457). This is because this branch contains merge commits from two branches and has a complicated history. But the diff is fine so I won't squash and force push it.
In the current state, this PR checks the runtime to see if the shuffling algorithm is enabled, but it's up to the client code to perform the actual shuffling.
I think a better way of doing this would be to embed the shuffling algorithm into the runtime and expose a runtime API for retrieving the mapping.
This way, further changes to the algorithm would be encapsulated into the runtime upgrade and it'd provide an easy way of using the shuffle in alternative clients without reimplementing it. Otherwise, reimplementing the shuffle in other languages may be tricky, cause duplicated effort and be error-prone.
WDYT?
@sandreim suggested opening an RFC for these details as they are quite important. I think I'll wotk on it
I think a better way of doing this would be to embed the shuffling algorithm into the runtime and expose a runtime API for retrieving the mapping.
What part of shuffling algorithm are you thinking to embed into the runtime? If that's the fn get_shuffle that will be called once per block, then it's probably fine. However, I imagine the other part that makes ParaId-specific shuffle is one most likely changing with the upcoming CoreJam model. We'd need to base it on CoreId most likely or something else. So we're back to square one. Although it's good point about reimplementing get_shuffle in other languages.
OTOH, if we keep it offchain, we can do optimizations like precomputing the shuffling per session once we got randomness and client features are expected to be on.
@sandreim suggested opening an RFC for these details as they are quite important. I think I'll work on it
SGTM. We should figure out the details of the shuffling before that.
I think a better way of doing this would be to embed the shuffling algorithm into the runtime and expose a runtime API for retrieving the mapping.
What part of shuffling algorithm are you thinking to embed into the runtime? If that's the
fn get_shufflethat will be called once per block, then it's probably fine. However, I imagine the other part that makesParaId-specific shuffle is one most likely changing with the upcoming CoreJam model. We'd need to base it onCoreIdmost likely or something else. So we're back to square one. Although it's good point about reimplementingget_shufflein other languages. OTOH, if we keep it offchain, we can do optimizations like precomputing the shuffling per session once we got randomness and client features are expected to be on.
Keeping it offchain sounds like a good idea to me. For performance considerations we should keep stuff that can be done offchain out of the runtime, unless there is a better reason not to.
Great work @alindima 🚀
What part of shuffling algorithm are you thinking to embed into the runtime?
I think the benefit is highest when we embed all of it into the runtime. By benefit I mean being able to atomically change the underlying shuffling algorithm. We could expose a couple of runtime APIs like:
fn get_availability_chunk_map(block_hash, para_id) -> Vec<ChunkIndex>
// The second vector is used to get the para-specific offset into the shuffle.
fn get_availability_chunk_para_maps(block_hash) -> (Vec<ChunkIndex>, Vec<ChunkIndex>)
Subsystems that need the para-specific map (availability-distribution) would just use the vector of get_availability_chunk_para_maps.
We could cache it in the various subsystems that use it, like we do in the current state of the PR anyway. availability-distribution would call it once per relay chain block. bitfield-signing and availability-recovery would also call it once per relay chain block.
Of course, disputes could also trigger availabiliy-recovery.
However, I imagine the other part that makes ParaId-specific shuffle is one most likely changing with the upcoming CoreJam model. We'd need to base it on CoreId most likely or something else.
Indeed this would have to change, as paraIds will no longer exist. I can't tell right now which ID will be the best option though, but I think you're right on CoreId being probably the right one once CoreJam lands. Or could it be WorkClass?
I assume CoreJam implementation will add some breaking changes in the consensus protocol that will require all validators to upgrade to latest client version anyway.
Nevertheless, my main argument for moving the shuffling algorithm on-chain is that it'd be easy to modify. Otherwise, we'd again need to coordinate an atomic modification of the client behaviour of all validators. AFAICT, the runtime is the best place for such coordination.
I am still not sure what's the best option right now
This pull request has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/parachain-consensus-updates-coretime-asynchronous-backing-scalability/4396/1
Discovered an issue that needs fixing: pov-recovery in cumulus does not work if the collator uses an RPC relay chain node.
It fails because the partial relay chain node built in cumulus does not have access to the ChainApiSubsystem to query the block number of a relay chain block by its hash.
There are a couple of ways I can fix this:
- pass the block number in as a (maybe optional) param in the
RecoverAvialbleDatapayload. don't query the chain api if supplied. - add an RPC method for the chain api
block_numbermethod and use it in cumulus. Make the ChainAPISubsystem generic in aBlockNumberProviderthat uses the DB for full nodes and relies on RPC for cumulus partial nodes
I'll punt this for now, as the most important thing that needs consensus at this point is the right shuffling algorithm
This pull request has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/polkadot-da-vs-competition/3403/5
Tested in versi, here are the performance results: https://github.com/paritytech/polkadot-sdk/issues/598#issuecomment-1792007099
Discovered an issue that needs fixing: pov-recovery in cumulus does not work if the collator uses an RPC relay chain node. It fails because the partial relay chain node built in cumulus does not have access to the ChainApiSubsystem to query the block number of a relay chain block by its hash.
Fixed the cumulus pov_recovery issue
opened RFC for discussing the details of the chunk index assignment function: https://github.com/polkadot-fellows/RFCs/pull/47
I had to do a little force push because I messed up a merge commit. That's also why the CI bot requested reviews from a bunch of groups that were not really needed
@ordian @alexggh @sandreim This is ready for another review, I implemented the updated proposal from the RFC. I only have to implement some zombienet tests and adapt subsystem-bench for it, but since it's a large PR, you can get started reviewing it
Some results from subsystem-bench:
with regular chunk recovery: CPU usage per block 39.82s
with recovery from backers: CPU usage per block 16.03s
with systematic recovery: CPU usage per block 19.07s
I just thought of another improvement to this PR (which I'd like to do as a follow-up, because this PR is already huge).
We can cache the req-response protocol version used for each validator, to avoid always trying v2 first for nodes that haven't been upgraded. We then listen for PeerConnected/PeerDisconnected messages from the network bridge and purge the protocol version from the cache for the validator. Also purge the entire cache on session change just to be safe.
I'll open a separate issue for this
This pull request has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/update-validator-set-size-increase-on-kusama/8218/1
I've conducted the final versi testing before merging this and it everything looks good! When enabled, total recovery time is halved and the cpu consumption associated to erasure coding is on average more than halved (from ~13% to ~5%).
I'll do a final pass through it, polish some metrics and logs and merge it soon
The CI pipeline was cancelled due to failure one of the required jobs. Job name: cargo-clippy Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6283397
We should still explore switching to https://github.com/AndersTrier/reed-solomon-simd since it seems much faster than our own reed-solomn crate.
each backer is only used for one systematic chunk, to not overload them
We could explore relaxing this restriction somewhat, after we know how things respond to this change in the real networks. Also this 1 depends somewhat upon the number of validators.
4000% faster as measured on my apple M2 Pro
Just a nit pick, we still need all approval checkers to rerun the encoding to check the encoding merkle root, so while the decoding becomes almost nothing, the overall availability system remains expensive.
I don't really think we overload the backers, but we do try backers as first option, and if that fails we try systematic chunks. Going back to backers doesn't really makes much sense in this setup.
We should still explore switching to https://github.com/AndersTrier/reed-solomon-simd since it seems much faster than our own reed-solomn crate.
noted: https://github.com/paritytech/polkadot-sdk/issues/605#issuecomment-2136756063
I don't really think we overload the backers, but we do try backers as first option, and if that fails we try systematic chunks. Going back to backers doesn't really makes much sense in this setup.
Current strategy is to try backers first if pov size is higher than 1 Mib. Otherwise, try systematic recovery (if feature is enabled). During systematic recovery, if a few validators did not give us the right chunk, we try to retrieve these chunks from the backers also (we only try each backer once). This is so that we may still do systematic recovery if one or a few validators are not responding
Just a nit pick, we still need all approval checkers to rerun the encoding to check the encoding merkle root, so while the decoding becomes almost nothing, the overall availability system remains expensive.
Of course. I discovered though that decoding is much slower than encoding. This PR, coupled with some other optimisations I did in our reed-solomon code still enable us to drop the CPU consumption for erasure coding from ~20% to 5% (as measured on versi, on a network with 50 validators and 10 glutton parachains).