optimism icon indicating copy to clipboard operation
optimism copied to clipboard

perf(derive): Cache batch validity

Open clabby opened this issue 1 year ago • 3 comments

Overview

Introduces an optimization to the CheckBatch function in the derivation pipeline, which caches the validity result for the BatchWithL1InclusionBlock. Currently, the BatchQueue redundantly performs this work twice, which bloats cycles in the fault proof program.

Safety

Once a batch has been deemed Accepted, it should be processed. This status effectively marks the batch as "sealed." As such, if CheckBatch says a batch is valid, we should cache the result for any future calls. Alternative statuses, i.e. BatchUndecided, & BatchFuture, should not be cached, as they are liable to change.

clabby avatar Jun 29 '24 20:06 clabby

  • #11055 Graphite 👈
  • develop

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @clabby and the rest of your teammates on Graphite Graphite

clabby avatar Jun 29 '24 20:06 clabby

Will review when back online. But from quick glance I am not sure if the cache invalidation of "if accepted, always accepted" holds true. E.g. the l1blocks input can change, and may affect the output in some ways that we need to look closer at.

Thanks! I'm pretty sure this is safe, but would definitely appreciate a sanity check.

Logic I'm following is:

  • Once a batch has been deemed as Accepted, within NextBatch's call to AddBatch, the batch's validity cannot change between the AddBatch validity check and deriveNextBatch afterwards
    • Note that this branch may never be hit if AddBatch was called in NextBatch.
  • Since AddBatch appends the new batch at the end of the queue, it is possible for l1Blocks to change as the epoch is advanced by deriveNextBatch, until it is popped off the head of the queue in popNextBatch
    • However, on the first pass, the batch should not have been deemed as Accepted if the l1Blocks array did not contain the next epoch as well as the batch origin in the first 2 elements. This would result in an Undecided or Drop state, so the cache would be invalidated on the next call to CheckBatch. This can be seen here & here.
  • Critically, if the batch's cached validity is anything other than Accept, we will always invalidate the cache and allow for a re-check. This enables the Undecided and Future states to continue functioning as intended.

May be missing something, but was unable to find a conflicting outcome in both the Rust and Golang implementation.

clabby avatar Jun 30 '24 16:06 clabby

Could the l2 head have changed between when a batch is added to the queue and when it actually gets processed as well? The use of a queue looks pretty weird given we seem to try to add a batch every time we call NextBatch and then immediately process it with deriveNextBatch but given there is a queue it seems to me it should be possible for things to be left in the queue between different calls to NextBatch and that those calls might then have a different parent eth.L2BlockRef. I guess that would only happen if the block was BatchFuture or BatchUndecided which we don't cache but it's not an obvious link in the code.

Also the cache add code generally looks weird to me, since it looks like it will drop the batch it read from prev.NextBatch if !originBehind: https://github.com/ethereum-optimism/optimism/blob/01d3a17195cc5bdc954263d0dc7a9ade3d5d8eba/op-node/rollup/derive/batch_queue.go#L149-L157

My gut feel here is that even if this is safe, we're adding another complex, somewhat implied invariant into what is already hard code to reason about. That's likely going to cause us bugs in the future. It is a nice improvement so I wouldn't block it if we decide it's safe, but if anyone has ideas on how we could refactor the code to not do the validation twice in the first place that would be ideal to me. Best I can think of is to make the caching a bit more specific and use the cached value only if the first l1Block and the l2 parent are the same as when we last checked. Still doesn't seem ideal (and I'm slightly nervous about the way span batch validation fetches additional L2 blocks as part of its verification).

ajsutton avatar Jul 07 '24 21:07 ajsutton

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 29 '24 01:07 github-actions[bot]

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Aug 16 '24 01:08 github-actions[bot]