polkadot-sdk Add availability-recovery from systematic chunks

Don't look at the commit history, it's confusing, as this branch is based on another branch that was merged

Fixes #598 Also implements RFC #47

Description

Availability-recovery now first attempts to request the systematic chunks for large POVs (which are the first ~n/3 chunks, which can recover the full data without doing the costly reed-solomon decoding process). This has a fallback of recovering from all chunks, if for some reason the process fails. Additionally, backers are also used as a backup for requesting the systematic chunks if the assigned validator is not offering the chunk (each backer is only used for one systematic chunk, to not overload them).
- Quite obviously, recovering from systematic chunks is much faster than recovering from regular chunks (4000% faster as measured on my apple M2 Pro).
Introduces a ValidatorIndex -> ChunkIndex mapping which is different for every core, in order to avoid only querying the first n/3 validators over and over again in the same session. The mapping is the one described in RFC 47.
The mapping is feature-gated by the NodeFeatures runtime API so that it can only be enabled via a governance call once a sufficient majority of validators have upgraded their client. If the feature is not enabled, the mapping will be the identity mapping and backwards-compatibility will be preserved.
Adds a new chunk request protocol version (v2), which adds the ChunkIndex to the response. This may or may not be checked against the expected chunk index. For av-distribution and systematic recovery, this will be checked, but for regular recovery, no. This is backwards compatible. First, a v2 request is attempted. If that fails during protocol negotiation, v1 is used.
Systematic recovery is only attempted during approval-voting, where we have easy access to the core_index. For disputes and collator pov_recovery, regular chunk requests are used, just as before.

Performance results

Some results from subsystem-bench:

with regular chunk recovery: CPU usage per block 39.82s with recovery from backers: CPU usage per block 16.03s with systematic recovery: CPU usage per block 19.07s

End-to-end results here: https://github.com/paritytech/polkadot-sdk/issues/598#issuecomment-1792007099

TODO:

[x] RFC #47
[x] merge https://github.com/paritytech/polkadot-sdk/pull/2177 and rebase on top of those changes
[x] merge https://github.com/paritytech/polkadot-sdk/pull/2771 and rebase
[x] add tests
[x] preliminary performance measure on Versi: see https://github.com/paritytech/polkadot-sdk/issues/598#issuecomment-1792007099
[x] Rewrite the implementer's guide documentation
[x] https://github.com/paritytech/polkadot-sdk/pull/3065
[x] https://github.com/paritytech/zombienet/issues/1705 and fix zombienet tests
[x] security audit
[ ] implement https://github.com/paritytech/polkadot-sdk/issues/3127 as a stacked PR
[ ] final versi test and performance measure

Sep 20 '23 11:09 alindima

Need to rebase on top of https://github.com/paritytech/polkadot-sdk/pull/1457, it currently contains changes from both PRs

Sep 20 '23 13:09 alindima

merged latest master into the branch and the diff looks good now.

Note that the commit history will look confusing, and appears to contain commits from both this PR and the previous one (#1457). This is because this branch contains merge commits from two branches and has a complicated history. But the diff is fine so I won't squash and force push it.

Sep 20 '23 13:09 alindima

In the current state, this PR checks the runtime to see if the shuffling algorithm is enabled, but it's up to the client code to perform the actual shuffling.

I think a better way of doing this would be to embed the shuffling algorithm into the runtime and expose a runtime API for retrieving the mapping.

This way, further changes to the algorithm would be encapsulated into the runtime upgrade and it'd provide an easy way of using the shuffle in alternative clients without reimplementing it. Otherwise, reimplementing the shuffle in other languages may be tricky, cause duplicated effort and be error-prone.

WDYT?

@sandreim suggested opening an RFC for these details as they are quite important. I think I'll wotk on it

Oct 23 '23 09:10 alindima

I think a better way of doing this would be to embed the shuffling algorithm into the runtime and expose a runtime API for retrieving the mapping.

What part of shuffling algorithm are you thinking to embed into the runtime? If that's the fn get_shuffle that will be called once per block, then it's probably fine. However, I imagine the other part that makes ParaId-specific shuffle is one most likely changing with the upcoming CoreJam model. We'd need to base it on CoreId most likely or something else. So we're back to square one. Although it's good point about reimplementing get_shuffle in other languages. OTOH, if we keep it offchain, we can do optimizations like precomputing the shuffling per session once we got randomness and client features are expected to be on.

@sandreim suggested opening an RFC for these details as they are quite important. I think I'll work on it

SGTM. We should figure out the details of the shuffling before that.

Oct 23 '23 10:10 ordian

I think a better way of doing this would be to embed the shuffling algorithm into the runtime and expose a runtime API for retrieving the mapping.

What part of shuffling algorithm are you thinking to embed into the runtime? If that's the fn get_shuffle that will be called once per block, then it's probably fine. However, I imagine the other part that makes ParaId-specific shuffle is one most likely changing with the upcoming CoreJam model. We'd need to base it on CoreId most likely or something else. So we're back to square one. Although it's good point about reimplementing get_shuffle in other languages. OTOH, if we keep it offchain, we can do optimizations like precomputing the shuffling per session once we got randomness and client features are expected to be on.

Keeping it offchain sounds like a good idea to me. For performance considerations we should keep stuff that can be done offchain out of the runtime, unless there is a better reason not to.

Oct 23 '23 11:10 sandreim

Great work @alindima 🚀

Oct 23 '23 11:10 sandreim

What part of shuffling algorithm are you thinking to embed into the runtime?

I think the benefit is highest when we embed all of it into the runtime. By benefit I mean being able to atomically change the underlying shuffling algorithm. We could expose a couple of runtime APIs like:

fn get_availability_chunk_map(block_hash, para_id) -> Vec<ChunkIndex>
// The second vector is used to get the para-specific offset into the shuffle.
fn get_availability_chunk_para_maps(block_hash) -> (Vec<ChunkIndex>, Vec<ChunkIndex>)

Subsystems that need the para-specific map (availability-distribution) would just use the vector of get_availability_chunk_para_maps.

We could cache it in the various subsystems that use it, like we do in the current state of the PR anyway. availability-distribution would call it once per relay chain block. bitfield-signing and availability-recovery would also call it once per relay chain block. Of course, disputes could also trigger availabiliy-recovery.

However, I imagine the other part that makes ParaId-specific shuffle is one most likely changing with the upcoming CoreJam model. We'd need to base it on CoreId most likely or something else.

Indeed this would have to change, as paraIds will no longer exist. I can't tell right now which ID will be the best option though, but I think you're right on CoreId being probably the right one once CoreJam lands. Or could it be WorkClass?

I assume CoreJam implementation will add some breaking changes in the consensus protocol that will require all validators to upgrade to latest client version anyway.

Nevertheless, my main argument for moving the shuffling algorithm on-chain is that it'd be easy to modify. Otherwise, we'd again need to coordinate an atomic modification of the client behaviour of all validators. AFAICT, the runtime is the best place for such coordination.

I am still not sure what's the best option right now

Oct 23 '23 11:10 alindima

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/parachain-consensus-updates-coretime-asynchronous-backing-scalability/4396/1

Oct 23 '23 12:10 Polkadot-Forum

Discovered an issue that needs fixing: pov-recovery in cumulus does not work if the collator uses an RPC relay chain node. It fails because the partial relay chain node built in cumulus does not have access to the ChainApiSubsystem to query the block number of a relay chain block by its hash.

There are a couple of ways I can fix this:

pass the block number in as a (maybe optional) param in the RecoverAvialbleData payload. don't query the chain api if supplied.
add an RPC method for the chain api block_number method and use it in cumulus. Make the ChainAPISubsystem generic in a BlockNumberProvider that uses the DB for full nodes and relies on RPC for cumulus partial nodes

I'll punt this for now, as the most important thing that needs consensus at this point is the right shuffling algorithm

Oct 27 '23 13:10 alindima

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-da-vs-competition/3403/5

Nov 01 '23 10:11 Polkadot-Forum

Tested in versi, here are the performance results: https://github.com/paritytech/polkadot-sdk/issues/598#issuecomment-1792007099

Nov 03 '23 08:11 alindima

Discovered an issue that needs fixing: pov-recovery in cumulus does not work if the collator uses an RPC relay chain node. It fails because the partial relay chain node built in cumulus does not have access to the ChainApiSubsystem to query the block number of a relay chain block by its hash.

Fixed the cumulus pov_recovery issue

Nov 10 '23 14:11 alindima

opened RFC for discussing the details of the chunk index assignment function: https://github.com/polkadot-fellows/RFCs/pull/47

Nov 13 '23 09:11 alindima

I had to do a little force push because I messed up a merge commit. That's also why the CI bot requested reviews from a bunch of groups that were not really needed

Nov 20 '23 11:11 alindima

@ordian @alexggh @sandreim This is ready for another review, I implemented the updated proposal from the RFC. I only have to implement some zombienet tests and adapt subsystem-bench for it, but since it's a large PR, you can get started reviewing it

Jan 18 '24 15:01 alindima

Some results from subsystem-bench:

with regular chunk recovery: CPU usage per block 39.82s with recovery from backers: CPU usage per block 16.03s with systematic recovery: CPU usage per block 19.07s

Jan 19 '24 11:01 alindima

I just thought of another improvement to this PR (which I'd like to do as a follow-up, because this PR is already huge).

We can cache the req-response protocol version used for each validator, to avoid always trying v2 first for nodes that haven't been upgraded. We then listen for PeerConnected/PeerDisconnected messages from the network bridge and purge the protocol version from the cache for the validator. Also purge the entire cache on session change just to be safe.

I'll open a separate issue for this

Jan 30 '24 09:01 alindima

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/update-validator-set-size-increase-on-kusama/8218/1

May 21 '24 16:05 Polkadot-Forum

I've conducted the final versi testing before merging this and it everything looks good! When enabled, total recovery time is halved and the cpu consumption associated to erasure coding is on average more than halved (from ~13% to ~5%).

I'll do a final pass through it, polish some metrics and logs and merge it soon

May 24 '24 10:05 alindima

The CI pipeline was cancelled due to failure one of the required jobs. Job name: cargo-clippy Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6283397

May 24 '24 11:05 paritytech-cicd-pr

We should still explore switching to https://github.com/AndersTrier/reed-solomon-simd since it seems much faster than our own reed-solomn crate.

each backer is only used for one systematic chunk, to not overload them

We could explore relaxing this restriction somewhat, after we know how things respond to this change in the real networks. Also this 1 depends somewhat upon the number of validators.

4000% faster as measured on my apple M2 Pro

Just a nit pick, we still need all approval checkers to rerun the encoding to check the encoding merkle root, so while the decoding becomes almost nothing, the overall availability system remains expensive.

May 29 '24 06:05 burdges

I don't really think we overload the backers, but we do try backers as first option, and if that fails we try systematic chunks. Going back to backers doesn't really makes much sense in this setup.

May 29 '24 07:05 sandreim

We should still explore switching to https://github.com/AndersTrier/reed-solomon-simd since it seems much faster than our own reed-solomn crate.

noted: https://github.com/paritytech/polkadot-sdk/issues/605#issuecomment-2136756063

May 29 '24 07:05 sandreim

I don't really think we overload the backers, but we do try backers as first option, and if that fails we try systematic chunks. Going back to backers doesn't really makes much sense in this setup.

Current strategy is to try backers first if pov size is higher than 1 Mib. Otherwise, try systematic recovery (if feature is enabled). During systematic recovery, if a few validators did not give us the right chunk, we try to retrieve these chunks from the backers also (we only try each backer once). This is so that we may still do systematic recovery if one or a few validators are not responding

Just a nit pick, we still need all approval checkers to rerun the encoding to check the encoding merkle root, so while the decoding becomes almost nothing, the overall availability system remains expensive.

Of course. I discovered though that decoding is much slower than encoding. This PR, coupled with some other optimisations I did in our reed-solomon code still enable us to drop the CPU consumption for erasure coding from ~20% to 5% (as measured on versi, on a network with 50 validators and 10 glutton parachains).

May 29 '24 08:05 alindima

polkadot-sdk polkadot-sdk copied to clipboard

Add availability-recovery from systematic chunks

Description

Performance results

TODO:

polkadot-sdk
polkadot-sdk copied to clipboard