nimbus-eth2 icon indicating copy to clipboard operation
nimbus-eth2 copied to clipboard

Implement further BLS batching and data parallelism

Open zah opened this issue 4 years ago • 5 comments

  • [ ] When syncing, we can apply BLS batching for a sequence of blocks.
  • [x] Incoming attestations can be grouped together and verified as a batch (a batch can be considered complete either by reaching target size or by the elapsing of a deadline timer).
  • [x] Use a thread pool to exploit multi-core parallelism.

zah avatar Feb 17 '21 20:02 zah

Batching opportunities overview

Recap

aggregate != batching

An aggregate is spec-defined, for a receiver:

  • the signature is received aggregated from the network
  • the attesters public key must be aggregated (on BLS G1)
  • only attestations are aggregated

Batching is client optimization

  • MillerLoop(PublicKey, Message, Signature)
  • FinalExponentiation(partial Miller loops)

According to Justin Drake there is an incoming optimization that will make adding public keys 50% faster in BLST and that it's the bottleneck with a large number of signatures, but in my benchmarks, it's only 5% of aggregation verification and it's unused for batching. Is he talking about something else?

validation != verification

Validation is according to gossipsub rebroadcasting rules

  • block
  • attestations
  • voluntary exits, attester slashing, block proposer slashing

Validation requires verifying 1 signature per object

Verification is according to consensus rules

  • for blocks:
    • state transition can be done using the block
    • crypto verification, including all nested crypto objects (attestations, exits, slashings, randao)
      • crypto verification for blocks is batched today.
  • for attestations:
    • 1 signature to verify
    • small consistency checks which seem redundant with gossip checks
  • voluntary exits, attester slashing, block proposer slashing:
    • 1 signature to verify
    • small consistency checks which seem redundant with gossip checks

Flow

For attestations, exists, slashings (during steady state):

  • Gossip -> eth2_processor -> validation + rebroadcast -> a "pool"

For blocks during sync:

  • Eth2 RPC -> sync_manager -> SharedBlockQueue -> clearance -> verification -> Candidate ChainDAG

For blocks during steady state:

  • Gossip -> eth2_processor -> validation + rebroadcast -> SharedBlockQueue -> clearance -> verification -> Candidate ChainDAG

During sync or steady state, for missing ancestors:

  • Eth2 RPC -> request_manager -> SharedBlockQueue -> clearance -> verification -> Candidate ChainDAG

Bottlenecks

During sync:

  • block processing speed
  • no mesh or latency expectations from connected peers

During steady state:

  • attestation processing speed (may be unaggregated!)
  • latency expectations from connected peers

Batching opportunities

For blocks:

  • we can batch during steady state, for validation
  • we can batch during steady state or sync, for verification
  • we can batch missing ancestors during steady state or sync, for verification

Analysis

In a well functioning chain we rarely receive more than 1 block per slot at a steady state. Our opportunity to receive many blocks that we can verify is only during sync or if we are missing a fork. So it only makes sense to batch the "SharedBlockQueue -> clearance -> verification". Par exemple by changing the SharedBlockQueue, from AsyncQueue[BlockEntry] to AsyncQueue[seq[BlockEntry]] so that inputs from the same sync request or ancestors request are grouped together.

One issue is that verifying blocks is expensive and we may not have enough time to verify 5 missing blocks at once during periods of non-finality without impacting validator duties or gossipsub connectivity requirements.

For attestations:

  • we can batch 'eth2_processor -> validation + rebroadcast -> a "pool"' during steady state
  • no attestations during sync

Analysis

We kind of already process multiple at once (without returning control to event loop) because validation is somewhat cheap (1 signature only, no state transition), so only thing missing is using crypto batchVerify

Architecture exploration

If we want to batch blocks, we need to avoid blocking the networking and validator duties (so DAG, forkchoice as well).

  • solution 1 requires a threadpool and a way to asyncify waiting for a Flowvar
  • solution 2 would split NBC in a networking thread and a verification thread and a way to asyncify waiting for a Channel or AsyncChannel
  • solution 3 would split NBC in a networking process and a verification process and communication via IPC and 2 event loops
  • solution 4 would split NBC in producer<->consumer services, at least one service will be networking and one will be consensus.

Whatever we use as a solution we need a way to tie in the communication (Channels, Flowvar) with Chronos. AsyncChannel seems stalled due to transporting GC-ed types on thread-local heap. ARC/ORC is out of question for the moment.

I've outlined a way to make both Flowvar (including ones for custom threadpools) or Channels (including Nim channels) work with Chronos, by creating an "async condition variable" (AsyncCV): https://github.com/nim-lang/RFCs/issues/304#issuecomment-789494498 So an AsyncChannel or an AsyncFlowvar is just "Channel + AsyncCV" or "Flowvar + AsyncCV".

mratsim avatar Mar 03 '21 17:03 mratsim

@cheatfate's implementation of AsyncChannels is already based on a primitive called AsyncThreadEvent. It seems to me that this is just another name for the condition variables you have in mind.

zah avatar Mar 03 '21 17:03 zah

The attestation batching I think is perhaps the easiest and the most impactful thing we can do, followed by sync blocks. Fork choice reorgs, not so much because there is still not a lot of them being created, even in forky situations..

One more thing that would be interesting about attestations is if we could somehow batch aggregating them - ie right now when an attestation arrives, after it's been verified we'll look for an existing attestation to aggregate it with - if somehow we could batch verification and aggregation, that would be pretty nice.

arnetheduck avatar Mar 03 '21 20:03 arnetheduck

@cheatfate's implementation of AsyncChannels is already based on a primitive called AsyncThreadEvent. It seems to me that this is just another name for the condition variables you have in mind.

Yes AsyncThreadEvent can be used to build the AsyncConditionVariable or AsyncSemaphores. In particular they would allow to await a Flowvar from an arbitrary threadpool.

I think it's worth it to split them from the stalled AsyncChannel PR.

The main issue with AsyncThreadEvent is that there is no "2-phase commit" protocol to ensure usage without deadlocks:

  1. Event loop creates a new ATE
  2. Event loop sends work and ATE to threadpool
  3. Threadpool is fast, returns back work and fire ATE
  4. Event loop wait on the ATE
  5. Event loop is "deadlocked" because ATE was already fired

But the base primitive can indeed be used as a replacement to wait/signal on my futexes: https://github.com/mratsim/weave/blob/30baced/weave/cross_thread_com/event_notifiers.nim#L67-L148 so that we have something that has been verified not to deadlock.

It's possible that the issues the desktop team is facing is in part due to that. It took me weeks, before resorting to formal verification to write the protocol properly. And even published papers in the past made errors http://research.microsoft.com/en-us/um/people/lamport/tla/dcas.pdf image

mratsim avatar Mar 04 '21 12:03 mratsim

Attestation batching is done in #2439.

mratsim avatar Mar 22 '21 10:03 mratsim