optimism op-node: Restore previous unsafe chain when invalid span batch

Description

Preliminary: Unsafe Head Reorg while Span Batch Derivation

Unsafe head reorg occurs when derivation result from batch differs from previously known unsafe chain. If every L2 attributes included in derived result from batch is valid, that will be the new unsafe/safe block.

If invalid L2 attribute is included in the derived results, safe head will not advance. On the other hand, unsafe head may advance to the middle of span batches. In short, we have below invariants:

Safe head update is atomic: Only update when every L2 attributes from batch is valid.
Unsafe head update is not atomic: Unsafe block can advance even if batch contains invalid L2 attributes.

I have written some examples to illustrate the behavior.

Example 1: Partially Invalid Span Batch

When deriving B2 safe which is a new block, engine queue calls fork choice update to update unsafe head to B2. After that, B3 is a invalid block, so leftover attributes is dropped. Final chain state will be the form:

As we can see, we lost A2, A3, A4 and A5.

Example 2: Totally Invalid Span Batch / Totally Invalid Singular Batch

When deriving B1 safe which is a new block, engine queue immediately drops entire batches. In this case, unsafe head is never altered, staying atA5. Final state will be the form:

Upper final state will be achieved when digesting invalid singular batch. Every singular batch derivation results in single attribute, so there is no partial invalid state. Example case:

Issue: Losing Unsafe Blocks which May be Eventually Canonical Hurts!

During happy path sync, distance between unsafe head and safe head may be drastically larger. In this case, if partially invalid span batch(like Example 1) is consumed by rollup node, we will lose all synced unsafe blocks.

Lets focus more at Example 1. We lost A2, A3, A4 and A5 because of invalid span batch. The digested span batch is invalid, so A2, A3, A4 and A5 is more likely to become a canonical safe chain, compared to B2.

Proposed Solution: Restore Previous unsafe head when Invalid Span Batch

I introduce new field backupUnsafeHead and needFCUCallForBackupUnsafeReorg to EngineController(which is embedded in engine queue).

If current known unsafe head's block number is larger or equal to valid attributes which is valid but results in different L2 attributes, initialize backupUnsafeHead to current known unsafe head. It will be a backup which will be used when invalid L2 block is detected. When invalid L2 block is detected, needFCUCallForBackupUnsafeReorg will be set to true.

Added TryBackupUnsafeReorg method which is similar to InsertUnsafePayload method. The method tries to call fork choice update(lets abbreviate this to FCU) without payloadAttributes to restore unsafeHead to backupUnsafe. FCU can return these responses: Spec.

If err is returned and it is not an InputError, it may be an error(network error/server fault) and retry.
If err is returned and it is an InputError, do not retry and forget about backupUnsafe.
If err is not returned, but payloadStatus is not VALID, forget about backupUnsafe.

Similar to https://github.com/ethereum-optimism/optimism/blob/develop/specs/derivation.md#l1-sync-payload-attributes-processing, but there is no reset.

If FCU returns VALID payloadStatus, this means execution engine successfully restored(reorged) unsafe head using backupUnsafe. Only update rollup node's state when FCU returns VALID. In other cases, forget about backupUnsafe.

When TryBackupUnsafeReorg performs a network call, it fully consumes single step of engine queue. This design pattern is similar with method TryUpdateEngine.

One thing that I would like to discuss is that, if execution engine did not return VALID payloadStatus, is it okay to simply progress engine queue? Is there any things to do, like cleanup? Maybe calling FCU with original unsafe head(not backupUnsafe) will be needed.

Tests

Added three e2e tests.

`TestBackupUnsafe`

Same situation introduced at Example 1. Unsafe head will be restored back to A5, instead of B2.

`TestBackupUnsafeReorgForkChoiceInputError`

Same situation introduced at Example 1. However, execution engine is mocked to return InputError when FCU is called. In this case, there will be no retry and no restoration will be done.

`TestBackupUnsafeReorgForkChoiceNotInputError`

Same situation introduced at Example 1. However, execution engine is mocked to return an error which is not InputError when FCU is called. In this case, there will be retry and eventually restoration will be done, reorging back unsafe head to A5.

Additional context

This is not a consensus change, but just need a node implementation update.

Partially invalid batches will be posted very unlikely. It is very hard to make a partially invalid batch unless there is a derivation/batching bug.

Jan 10 '24 08:01 pcw109550

Walkthrough

This update enhances error handling in L2 RPC failure simulation, introduces a new method for unsafe L2 block backup, and improves reorganization logic in the engine. It enables specifying errors in L2 RPC failure simulations, adds backup functionality for unsafe L2 blocks, and integrates new logic for managing backup states and reorganizations within the engine controller and queue, enhancing system resilience and debugging capabilities.

Changes

Files	Change Summary
`.../actions/l2_engine.go` `.../actions/l2_engine_test.go`	Updated `ActL2RPCFail` to accept an error parameter and modified tests to check for specific errors.
`.../actions/l2_verifier.go`	Added `L2BackupUnsafe` method for unsafe L2 block backup.
`.../actions/sync_test.go`	Introduced new tests for unsafe block handling and backup restoration.
`.../rollup/derive/engine_controller.go` `.../rollup/derive/engine_queue.go`	Added fields and methods for managing backup unsafe heads and reorg logic, including tracking, setting, and attempting backup unsafe reorgs. Integrated these functionalities into the engine's state management and queue logic.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit-tests for this file.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit tests for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository from git and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit tests.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
The JSON schema for the configuration file is available here.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

Jan 10 '24 09:01 coderabbitai[bot]

These images and some of the text would be great to have added to the specs

Jan 10 '24 14:01 tynes

I think that the flow is as follows:

We force a span-batch-middle for the first time. This is where we should save the old unsafe head

We potentially insert several more span batches

The span batch fails. Then we should re-org back to the old unsafe head.

A couple pointers as this code is rapidly changing. We should put as much engine manipulation inside engine controller as possible and have a small set of external methods. It should limit the amount of internal state it is exposing.

I'm going to be putting up a couple more PRs to keep cleaning it up, but if we need to change the interface between the engine queue & the engine controller, we should.

For step 2, we potentially insert several more L2 attributes from a single span batch, not several more span batches. Other than that, the flow is right.

I have noticed that other engine queue related changes are in progress, like https://github.com/ethereum-optimism/optimism/pull/8966 and https://github.com/ethereum-optimism/optimism/pull/8968. Will rebase when these are merged. I agree that we should limit internal state exposure of engine controller.

Jan 15 '24 07:01 pcw109550

These images and some of the text would be great to have added to the specs

Current L1-sync: payload attributes processing specs does not mention about span batches. They are intended to be working as span-batch agnostic.

The diagram and the text explains the current engine queue implementation about L1-sync, but not the spec. So this assets may be added to the tech docs, not the specs. What do you think? Also, are there any public tech docs which explains current implementations?

Jan 15 '24 07:01 pcw109550

@pcw109550 more merge conflicts, but I think they're getting close to the end

Jan 19 '24 00:01 trianglesphere

@trianglesphere rebased since engine queue altering changes: https://github.com/ethereum-optimism/optimism/pull/8966, https://github.com/ethereum-optimism/optimism/pull/8968 are all merged. PTAL

Jan 21 '24 16:01 pcw109550

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Feb 14 '24 01:02 github-actions[bot]

@trianglesphere @axelKingsley @tynes May I please ask for review?

Feb 14 '24 06:02 pcw109550

op-node: Restore previous unsafe chain when invalid span batch

Preliminary: Unsafe Head Reorg while Span Batch Derivation

Example 1: Partially Invalid Span Batch

Example 2: Totally Invalid Span Batch / Totally Invalid Singular Batch

Issue: Losing Unsafe Blocks which May be Eventually Canonical Hurts!

Proposed Solution: Restore Previous unsafe head when Invalid Span Batch

TestBackupUnsafe

TestBackupUnsafeReorgForkChoiceInputError

TestBackupUnsafeReorgForkChoiceNotInputError

Walkthrough

Changes

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

CodeRabbit Discord Community

`TestBackupUnsafe`

`TestBackupUnsafeReorgForkChoiceInputError`

`TestBackupUnsafeReorgForkChoiceNotInputError`

CodeRabbit Configration File (`.coderabbit.yaml`)