nimbus-eth2
nimbus-eth2 copied to clipboard
Implement further BLS batching and data parallelism
- [ ] When syncing, we can apply BLS batching for a sequence of blocks.
- [x] Incoming attestations can be grouped together and verified as a batch (a batch can be considered complete either by reaching target size or by the elapsing of a deadline timer).
- [x] Use a thread pool to exploit multi-core parallelism.
Batching opportunities overview
Recap
aggregate != batching
An aggregate is spec-defined, for a receiver:
- the signature is received aggregated from the network
- the attesters public key must be aggregated (on BLS G1)
- only attestations are aggregated
Batching is client optimization
- MillerLoop(PublicKey, Message, Signature)
- FinalExponentiation(partial Miller loops)
According to Justin Drake there is an incoming optimization that will make adding public keys 50% faster in BLST and that it's the bottleneck with a large number of signatures, but in my benchmarks, it's only 5% of aggregation verification and it's unused for batching. Is he talking about something else?
validation != verification
Validation is according to gossipsub rebroadcasting rules
- block
- attestations
- voluntary exits, attester slashing, block proposer slashing
Validation requires verifying 1 signature per object
Verification is according to consensus rules
- for blocks:
- state transition can be done using the block
- crypto verification, including all nested crypto objects (attestations, exits, slashings, randao)
- crypto verification for blocks is batched today.
- for attestations:
- 1 signature to verify
- small consistency checks which seem redundant with gossip checks
- voluntary exits, attester slashing, block proposer slashing:
- 1 signature to verify
- small consistency checks which seem redundant with gossip checks
Flow
For attestations, exists, slashings (during steady state):
- Gossip -> eth2_processor -> validation + rebroadcast -> a "pool"
For blocks during sync:
- Eth2 RPC -> sync_manager -> SharedBlockQueue -> clearance -> verification -> Candidate ChainDAG
For blocks during steady state:
- Gossip -> eth2_processor -> validation + rebroadcast -> SharedBlockQueue -> clearance -> verification -> Candidate ChainDAG
During sync or steady state, for missing ancestors:
- Eth2 RPC -> request_manager -> SharedBlockQueue -> clearance -> verification -> Candidate ChainDAG
Bottlenecks
During sync:
- block processing speed
- no mesh or latency expectations from connected peers
During steady state:
- attestation processing speed (may be unaggregated!)
- latency expectations from connected peers
Batching opportunities
For blocks:
- we can batch during steady state, for validation
- we can batch during steady state or sync, for verification
- we can batch missing ancestors during steady state or sync, for verification
Analysis
In a well functioning chain we rarely receive more than 1 block per slot at a steady state.
Our opportunity to receive many blocks that we can verify is only during sync or if we are missing a fork.
So it only makes sense to batch the "SharedBlockQueue -> clearance -> verification".
Par exemple by changing the SharedBlockQueue, from AsyncQueue[BlockEntry]
to AsyncQueue[seq[BlockEntry]]
so that inputs from the same sync request or ancestors request are grouped together.
One issue is that verifying blocks is expensive and we may not have enough time to verify 5 missing blocks at once during periods of non-finality without impacting validator duties or gossipsub connectivity requirements.
For attestations:
- we can batch 'eth2_processor -> validation + rebroadcast -> a "pool"' during steady state
- no attestations during sync
Analysis
We kind of already process multiple at once (without returning control to event loop)
because validation is somewhat cheap (1 signature only, no state transition),
so only thing missing is using crypto batchVerify
Architecture exploration
If we want to batch blocks, we need to avoid blocking the networking and validator duties (so DAG, forkchoice as well).
- solution 1 requires a threadpool and a way to asyncify waiting for a Flowvar
- solution 2 would split NBC in a networking thread and a verification thread and a way to asyncify waiting for a Channel or AsyncChannel
- solution 3 would split NBC in a networking process and a verification process and communication via IPC and 2 event loops
- solution 4 would split NBC in producer<->consumer services, at least one service will be networking and one will be consensus.
Whatever we use as a solution we need a way to tie in the communication (Channels, Flowvar) with Chronos. AsyncChannel seems stalled due to transporting GC-ed types on thread-local heap. ARC/ORC is out of question for the moment.
I've outlined a way to make both Flowvar (including ones for custom threadpools) or Channels (including Nim channels) work with Chronos, by creating an "async condition variable" (AsyncCV): https://github.com/nim-lang/RFCs/issues/304#issuecomment-789494498 So an AsyncChannel or an AsyncFlowvar is just "Channel + AsyncCV" or "Flowvar + AsyncCV".
@cheatfate's implementation of AsyncChannels is already based on a primitive called AsyncThreadEvent
. It seems to me that this is just another name for the condition variables you have in mind.
The attestation batching I think is perhaps the easiest and the most impactful thing we can do, followed by sync blocks. Fork choice reorgs, not so much because there is still not a lot of them being created, even in forky situations..
One more thing that would be interesting about attestations is if we could somehow batch aggregating them - ie right now when an attestation arrives, after it's been verified we'll look for an existing attestation to aggregate it with - if somehow we could batch verification and aggregation, that would be pretty nice.
@cheatfate's implementation of AsyncChannels is already based on a primitive called
AsyncThreadEvent
. It seems to me that this is just another name for the condition variables you have in mind.
Yes AsyncThreadEvent can be used to build the AsyncConditionVariable or AsyncSemaphores. In particular they would allow to await a Flowvar from an arbitrary threadpool.
I think it's worth it to split them from the stalled AsyncChannel PR.
The main issue with AsyncThreadEvent is that there is no "2-phase commit" protocol to ensure usage without deadlocks:
- Event loop creates a new ATE
- Event loop sends work and ATE to threadpool
- Threadpool is fast, returns back work and fire ATE
- Event loop wait on the ATE
- Event loop is "deadlocked" because ATE was already fired
But the base primitive can indeed be used as a replacement to wait
/signal
on my futexes: https://github.com/mratsim/weave/blob/30baced/weave/cross_thread_com/event_notifiers.nim#L67-L148 so that we have something that has been verified not to deadlock.
It's possible that the issues the desktop team is facing is in part due to that. It took me weeks, before resorting to formal verification to write the protocol properly. And even published papers in the past made errors http://research.microsoft.com/en-us/um/people/lamport/tla/dcas.pdf
Attestation batching is done in #2439.