flow-go icon indicating copy to clipboard operation
flow-go copied to clipboard

Suggested compliance revisions

Open AlexHentschel opened this issue 2 years ago • 2 comments

This is a list of suggested revisions and improvements that resulted from reviewing PR #3294.

  • [ ] in the collector compliance.Core, method OnBlockProposal, we query the finalized block directly from the data base (code)

    • this data base interaction is relatively expensive
    • the compliance engine already subscribes to block finalization events (-> code) and could easily get the height from the finalized blocks from the notifications

    Notifications are asynchronous. Hence, the view of compliance.Core would be eventually consistent. This is sufficient, because the height is only a heuristic cut-off for "too new blocks" (mitigating memory exhaustion attacks), where the actual limit can vary over time as long as there is a finite limit.

  • [ ] this condition https://github.com/onflow/flow-go/blob/b4a7dbb4a6b7cf09a481f7eaa49c5bd8057f2427/engine/collection/compliance/core.go#L137 can be rephrased. In its current form, we require the first condition

    header.Height > final.Height
    

    to prevent an underflow in the second condition

    header.Height-final.Height > c.config.SkipNewProposalsThreshold
    

    However, with a simple arithmetic transformation (only using addition, thereby avoiding any underflow) we can just write:

    header.Height > final.Height + c.config.SkipNewProposalsThreshold
    
    • I think it would be beneficial to do this check at the very beginning of the method. Thereby, we avoid expensive data base queries for blocks very far into the future (Headers queries data base on a cache miss).
  • [ ] At the moment, the collector compliance core (-> code), processes old proposals below the finalization height. This can be exploited by byzantine replicas that just send us duplicates of very old blocks. For these blocks, we query Headers, which will fall back to a time-intensive data base query for an old block. Furthermore, the old block will then be added to the Headers-internal cache, potentially evicting newer blocks. So a byzantine replica can cause many expensive data base queries and cache eviction of needed data by replaying very old blocks.

    Ideally, blocks with views below the finalized view would be dropped right away in the compliance engine and not even queued. Also, a filter in Core.OnBlockProposal for the view would be great.

  • [ ] the compliance layer could add validated and invalid blocks directly to the VoteAggregator. At the moment, this logic is a bit confusing, because the compliance layer informs the VoteAggregator about invalid blocks while the EventHandler synchronously adds valid blocks to VoteAggregator. Having the compliance layer add blocks (invalid or valid) would make the code more consistent and potentially more performant, as it removes work from the event loop

  • [ ] can we address this edge case? There might be a non-trivial spamming opportunity here depending on the value of SkipNewProposalsThreshold (-> code)

AlexHentschel avatar Oct 12 '22 20:10 AlexHentschel

can we address this edge case? There might be a non-trivial spamming opportunity here depending on the value of SkipNewProposalsThreshold

Could you explain how this relates to a spamming opportunity?

jordanschalm avatar Oct 13 '22 13:10 jordanschalm

I am not sure whether that edge case really opens a spamming vulnerability. But here are my thoughts why I am worried:

  • Personally, I find Moxie Marlinspike's (generalized) Cryptographic Doom Principle a very useful guideline. This "generalization" was formulated by our former colleague Francois Garillot:

    If you spend any resources on a message before confirming the message's validity, it will somehow inevitably lead to doom.

    This is a very general principle, which hides some practically important limitations of this principle:

    • our messages (blocks) have an upper bound on size (small), so an attacker needs to be able to send many adversarial messages
    • sending those messages needs to be of low cost for the attacker (if we can detect such an attack with reasonably high probability and slash the attacker on a reasonably short time scale, than it is not low cost for the attacker)
  • So lets check:

    • we spend networking resources receiving the proposal and then spend CPU cycles processing the proposal
    • an attacker can send many pending blocks
    • we can't slash the attacker, if it crafts its blocks such that they fall in the edge case

    Therefore, I am worried, because the Cryptographic Doom Principle tells us that there might be a vulnerability here

Lets take a closer look at what could happen:

  • the default value for SkipNewProposalsThreshold is 100k blocks, roughly 27 hours. I find it plausible that at times the new epoch might not be committed 27 hours before it starts.

  • An attacker can now send us thousands of blocks for the next epoch that we can't check

    • if the blocks are cache, the attacker floods our cache with garbage, effectively rendering it useless
    • if the blocks are processed and dropped during the edge case, we could spend many CPU cycles processing and discarding the garbage

    This can go on for a prolonged period of time, potentially causing hours of dramatically degraded service, which might already be enough for an attacker to accept the penalty of being eventually slashed.

Overall, I don't expect that we can avoid all spamming possibilities entirely. But in cases where we can close the vulnerability with reasonable engineering effort, I think it might be worth it. In this particular case, I think we could just not cache any blocks that are for Epochs that we don't know the committee for, which wouldn't close the spamming surface entirely, but at least mitigate the cache-flooding attack.

AlexHentschel avatar Oct 13 '22 19:10 AlexHentschel

if the blocks are cache, the attacker floods our cache with garbage, effectively rendering it useless

Thank you for the explanation. Initially I thought this was no different than garbage blocks within a known epoch, but disconnected from the state. But when we can validate these blocks' QCs regardless of whether they connect to the finalized state (using the epoch-scoped committee), then the view for unknown epoch case is indeed distinct. Agree, we should check that the epoch is known before caching it.

Currently, we cache any blocks which don't immediately connect to the finalized state. When we disconnect QC validation from protocol state validation (perform QC validation first, always), we can immediately discard blocks which aren't in a known epoch (will fit neatly into error handling for ValidateProposal).

jordanschalm avatar Oct 18 '22 13:10 jordanschalm