perf(derive): Cache batch validity
Overview
Introduces an optimization to the CheckBatch function in the derivation pipeline, which caches the validity result for the BatchWithL1InclusionBlock. Currently, the BatchQueue redundantly performs this work twice, which bloats cycles in the fault proof program.
Safety
Once a batch has been deemed Accepted, it should be processed. This status effectively marks the batch as "sealed." As such, if CheckBatch says a batch is valid, we should cache the result for any future calls. Alternative statuses, i.e. BatchUndecided, & BatchFuture, should not be cached, as they are liable to change.
This stack of pull requests is managed by Graphite. Learn more about stacking.
Join @clabby and the rest of your teammates on
Graphite
Will review when back online. But from quick glance I am not sure if the cache invalidation of "if accepted, always accepted" holds true. E.g. the l1blocks input can change, and may affect the output in some ways that we need to look closer at.
Thanks! I'm pretty sure this is safe, but would definitely appreciate a sanity check.
Logic I'm following is:
- Once a batch has been deemed as
Accepted, withinNextBatch's call toAddBatch, the batch's validity cannot change between theAddBatchvalidity check andderiveNextBatchafterwards- Note that this branch may never be hit if
AddBatchwas called inNextBatch.
- Note that this branch may never be hit if
- Since
AddBatchappends the new batch at the end of the queue, it is possible forl1Blocksto change as the epoch is advanced byderiveNextBatch, until it is popped off the head of the queue inpopNextBatch- However, on the first pass, the batch should not have been deemed as
Accepted if thel1Blocksarray did not contain the next epoch as well as the batch origin in the first 2 elements. This would result in anUndecidedorDropstate, so the cache would be invalidated on the next call toCheckBatch. This can be seen here & here.
- However, on the first pass, the batch should not have been deemed as
- Critically, if the batch's cached validity is anything other than
Accept, we will always invalidate the cache and allow for a re-check. This enables theUndecidedandFuturestates to continue functioning as intended.
May be missing something, but was unable to find a conflicting outcome in both the Rust and Golang implementation.
Could the l2 head have changed between when a batch is added to the queue and when it actually gets processed as well? The use of a queue looks pretty weird given we seem to try to add a batch every time we call NextBatch and then immediately process it with deriveNextBatch but given there is a queue it seems to me it should be possible for things to be left in the queue between different calls to NextBatch and that those calls might then have a different parent eth.L2BlockRef. I guess that would only happen if the block was BatchFuture or BatchUndecided which we don't cache but it's not an obvious link in the code.
Also the cache add code generally looks weird to me, since it looks like it will drop the batch it read from prev.NextBatch if !originBehind: https://github.com/ethereum-optimism/optimism/blob/01d3a17195cc5bdc954263d0dc7a9ade3d5d8eba/op-node/rollup/derive/batch_queue.go#L149-L157
My gut feel here is that even if this is safe, we're adding another complex, somewhat implied invariant into what is already hard code to reason about. That's likely going to cause us bugs in the future. It is a nice improvement so I wouldn't block it if we decide it's safe, but if anyone has ideas on how we could refactor the code to not do the validation twice in the first place that would be ideal to me. Best I can think of is to make the caching a bit more specific and use the cached value only if the first l1Block and the l2 parent are the same as when we last checked. Still doesn't seem ideal (and I'm slightly nervous about the way span batch validation fetches additional L2 blocks as part of its verification).
This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.