optimism
optimism copied to clipboard
op-node: Restore previous unsafe chain when invalid span batch
Description
Preliminary: Unsafe Head Reorg while Span Batch Derivation
Unsafe head reorg occurs when derivation result from batch differs from previously known unsafe chain. If every L2 attributes included in derived result from batch is valid, that will be the new unsafe/safe block.
If invalid L2 attribute is included in the derived results, safe head will not advance. On the other hand, unsafe head may advance to the middle of span batches. In short, we have below invariants:
- Safe head update is atomic: Only update when every L2 attributes from batch is valid.
- Unsafe head update is not atomic: Unsafe block can advance even if batch contains invalid L2 attributes.
I have written some examples to illustrate the behavior.
Example 1: Partially Invalid Span Batch
When deriving B2
safe which is a new block, engine queue calls fork choice update to update unsafe head to B2
. After that, B3
is a invalid block, so leftover attributes is dropped. Final chain state will be the form:
As we can see, we lost A2
, A3
, A4
and A5
.
Example 2: Totally Invalid Span Batch / Totally Invalid Singular Batch
When deriving B1
safe which is a new block, engine queue immediately drops entire batches. In this case, unsafe head is never altered, staying atA5
. Final state will be the form:
Upper final state will be achieved when digesting invalid singular batch. Every singular batch derivation results in single attribute, so there is no partial invalid state. Example case:
Issue: Losing Unsafe Blocks which May be Eventually Canonical Hurts!
During happy path sync, distance between unsafe head and safe head may be drastically larger. In this case, if partially invalid span batch(like Example 1) is consumed by rollup node, we will lose all synced unsafe blocks.
Lets focus more at Example 1. We lost A2
, A3
, A4
and A5
because of invalid span batch. The digested span batch is invalid, so A2
, A3
, A4
and A5
is more likely to become a canonical safe chain, compared to B2
.
Proposed Solution: Restore Previous unsafe head when Invalid Span Batch
I introduce new field backupUnsafeHead
and needFCUCallForBackupUnsafeReorg
to EngineController
(which is embedded in engine queue).
If current known unsafe head's block number is larger or equal to valid attributes which is valid but results in different L2 attributes, initialize backupUnsafeHead
to current known unsafe head. It will be a backup which will be used when invalid L2 block is detected. When invalid L2 block is detected, needFCUCallForBackupUnsafeReorg
will be set to true.
Added TryBackupUnsafeReorg
method which is similar to InsertUnsafePayload
method. The method tries to call fork choice update(lets abbreviate this to FCU) without payloadAttributes
to restore unsafeHead to backupUnsafe
. FCU can return these responses: Spec.
- If err is returned and it is not an
InputError
, it may be an error(network error/server fault) and retry. - If err is returned and it is an
InputError
, do not retry and forget aboutbackupUnsafe
. - If err is not returned, but payloadStatus is not
VALID
, forget aboutbackupUnsafe
.
Similar to https://github.com/ethereum-optimism/optimism/blob/develop/specs/derivation.md#l1-sync-payload-attributes-processing, but there is no reset.
If FCU returns VALID
payloadStatus, this means execution engine successfully restored(reorged) unsafe head using backupUnsafe
. Only update rollup node's state when FCU returns VALID
. In other cases, forget about backupUnsafe
.
When TryBackupUnsafeReorg
performs a network call, it fully consumes single step of engine queue. This design pattern is similar with method TryUpdateEngine
.
One thing that I would like to discuss is that, if execution engine did not return VALID
payloadStatus, is it okay to simply progress engine queue? Is there any things to do, like cleanup? Maybe calling FCU with original unsafe head(not backupUnsafe) will be needed.
Tests
Added three e2e tests.
TestBackupUnsafe
Same situation introduced at Example 1. Unsafe head will be restored back to A5
, instead of B2
.
TestBackupUnsafeReorgForkChoiceInputError
Same situation introduced at Example 1. However, execution engine is mocked to return InputError
when FCU is called. In this case, there will be no retry and no restoration will be done.
TestBackupUnsafeReorgForkChoiceNotInputError
Same situation introduced at Example 1. However, execution engine is mocked to return an error which is not InputError
when FCU is called. In this case, there will be retry and eventually restoration will be done, reorging back unsafe head to A5
.
Additional context
This is not a consensus change, but just need a node implementation update.
Partially invalid batches will be posted very unlikely. It is very hard to make a partially invalid batch unless there is a derivation/batching bug.
Walkthrough
Walkthrough
This update enhances error handling in L2 RPC failure simulation, introduces a new method for unsafe L2 block backup, and improves reorganization logic in the engine. It enables specifying errors in L2 RPC failure simulations, adds backup functionality for unsafe L2 blocks, and integrates new logic for managing backup states and reorganizations within the engine controller and queue, enhancing system resilience and debugging capabilities.
Changes
Files | Change Summary |
---|---|
.../actions/l2_engine.go .../actions/l2_engine_test.go |
Updated ActL2RPCFail to accept an error parameter and modified tests to check for specific errors. |
.../actions/l2_verifier.go |
Added L2BackupUnsafe method for unsafe L2 block backup. |
.../actions/sync_test.go |
Introduced new tests for unsafe block handling and backup restoration. |
.../rollup/derive/engine_controller.go .../rollup/derive/engine_queue.go |
Added fields and methods for managing backup unsafe heads and reorg logic, including tracking, setting, and attempting backup unsafe reorgs. Integrated these functionalities into the engine's state management and queue logic. |
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
-
I pushed a fix in commit <commit_id>.
-
Generate unit-tests for this file.
-
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitai
in a new review comment at the desired location with your query. Examples:-
@coderabbitai generate unit tests for this file.
-
@coderabbitai modularize this function.
-
- PR comments: Tag
@coderabbitai
in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:-
@coderabbitai generate interesting stats about this repository from git and render them as a table.
-
@coderabbitai show all the console.log statements in this repository.
-
@coderabbitai read src/utils.ts and generate unit tests.
-
@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
-
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (invoked as PR comments)
-
@coderabbitai pause
to pause the reviews on a PR. -
@coderabbitai resume
to resume the paused reviews. -
@coderabbitai review
to trigger a review. This is useful when automatic reviews are disabled for the repository. -
@coderabbitai resolve
resolve all the CodeRabbit review comments. -
@coderabbitai help
to get help.
Additionally, you can add @coderabbitai ignore
anywhere in the PR description to prevent this PR from being reviewed.
CodeRabbit Configration File (.coderabbit.yaml
)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yaml
file to the root of your repository. - The JSON schema for the configuration file is available here.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json
CodeRabbit Discord Community
Join our Discord Community to get help, request features, and share feedback.
These images and some of the text would be great to have added to the specs
I think that the flow is as follows:
- We force a span-batch-middle for the first time. This is where we should save the old unsafe head
- We potentially insert several more span batches
- The span batch fails. Then we should re-org back to the old unsafe head.
A couple pointers as this code is rapidly changing. We should put as much engine manipulation inside engine controller as possible and have a small set of external methods. It should limit the amount of internal state it is exposing.
I'm going to be putting up a couple more PRs to keep cleaning it up, but if we need to change the interface between the engine queue & the engine controller, we should.
For step 2, we potentially insert several more L2 attributes from a single span batch, not several more span batches. Other than that, the flow is right.
I have noticed that other engine queue related changes are in progress, like https://github.com/ethereum-optimism/optimism/pull/8966 and https://github.com/ethereum-optimism/optimism/pull/8968. Will rebase when these are merged. I agree that we should limit internal state exposure of engine controller.
These images and some of the text would be great to have added to the specs
Current L1-sync: payload attributes processing specs does not mention about span batches. They are intended to be working as span-batch agnostic.
The diagram and the text explains the current engine queue implementation about L1-sync, but not the spec. So this assets may be added to the tech docs, not the specs. What do you think? Also, are there any public tech docs which explains current implementations?
@pcw109550 more merge conflicts, but I think they're getting close to the end
@trianglesphere rebased since engine queue altering changes: https://github.com/ethereum-optimism/optimism/pull/8966, https://github.com/ethereum-optimism/optimism/pull/8968 are all merged. PTAL
This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.
@trianglesphere @axelKingsley @tynes May I please ask for review?