optimism [BUG] op-node parse L1 batch data error: `frameLength is too large: 2838834252`

Describe the bug A clear and concise description of what the bug is.

Op-node parse batch data submitted by batcher encounter a irretrievable error: frameLength is too large: 2838834252, which will make op-node unable to reconcile with L1

To Reproduce Steps to reproduce the behavior:

stop batcher for several hours
start batcher
op-node print error log: Feb 22, 2023 @ 11:32:26.580 WARN [02-22|03:32:26.580] Failed to parse frames origin=f4c844..525ff1:1589279 err="frameLength is too large: 2838834252"

Expected behavior A clear and concise description of what you expected to happen.

Op-node should keep reconcile with L1

Screenshots If applicable, add screenshots to help explain your problem.

System Specs:

OS: Ubuntu
Package Version (or commit hash): 173181bd7f45aae9225523071defd5cf202fd812

Additional context Add any other context about the problem here.

Feb 22 '23 10:02 nolanxyg

This error looks wrong to me: the frameLength should never be that large. I think it hit a decoding error, and read the wrong data as frame-length.

In the verifier it was recently changed (thus also the different error message formatting) due to a finding of the Sherlock bug-bounty contest; https://github.com/ethereum-optimism/optimism/pull/4867 which may be related.

But the error path is still the same:

The frame is too large: https://github.com/ethereum-optimism/optimism/blob/72f24f11d21c167108082fdbe5bdffb3293348ed/op-node/rollup/derive/frame.go#L90
The frame decoding errors: https://github.com/ethereum-optimism/optimism/blob/72f24f11d21c167108082fdbe5bdffb3293348ed/op-node/rollup/derive/frame.go#L140
The bad frame is ignored: https://github.com/ethereum-optimism/optimism/blob/9a198e90b1a7ab90a2ca85a3277576103eba6548/op-node/rollup/derive/frame_queue.go#L45

This verifier code-path here functions as expected: bad frames are dropped and skipped over.

The batcher however needs to ensure it only outputs valid frames, or the frames will not be accepted.

The maximum frame size is configurable: https://github.com/ethereum-optimism/optimism/blob/e03b50731e0718a311f4f59e0699c90e6a598759/op-batcher/batcher/channel_builder.go#L26

And based on the max L1 tx size by default: https://github.com/ethereum-optimism/optimism/blob/3468c6af6df8132c848f6f329a0a5592e6a14800/op-batcher/batcher/driver.go#L92

Which is smaller than the enforced frame size: https://github.com/ethereum-optimism/optimism/blob/72f24f11d21c167108082fdbe5bdffb3293348ed/op-node/rollup/derive/frame.go#L14

So assuming the settings are left to defaults, then this may indicate some bug (that may be related or duplicate of the Sherlock contest finding) in the frame-encoding when buffering large amounts of data that corrupts the output (resulting in the ridiculous 2.8+ GB frame-size number).

cc @sebastianst flagging possible op-batcher bug.

Feb 22 '23 21:02 protolambda

Thanks @protolambda for the initial analysis!

The frameLength of 2838834252 looks suspicious to me because it has the highest uint32 bit set: I think it never was a frameLength to begin with but instead is just random data that got decoded into a frame length. The unmarshaling code doesn't do any sanity check before this first check on the max. frame length. I bet the decoded frame number and channel id before are also random.

@nolanxyg could you provide some more context?

What op-batcher and op-node configuration do you use?
- In particular, what channel configuration (target and max tx size, approx. compr. ratio)
What's your L1 here?
Is there any tx volume happening during the hours that your batcher is offline?

Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit https://github.com/ethereum-optimism/optimism/commit/173181bd7f45aae9225523071defd5cf202fd812.

Feb 22 '23 22:02 sebastianst

Tracking at CLI-3428

Feb 23 '23 10:02 sebastianst

@nolanxyg see my above message, would love to get more info so that we can reproduce 🙏🏻

Feb 25 '23 11:02 sebastianst

Sorry🙏🏻 @s7v7nislands pls help to provide more context, thanks

Feb 26 '23 10:02 nolanxyg

@nolanxyg could you provide some more context?

What op-batcher and op-node configuration do you use?

In particular, what channel configuration (target and max tx size, approx. compr. ratio)

What's your L1 here?

Is there any tx volume happening during the hours that your batcher is offline?

Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit 173181b.

BSC
Yes, we were doing load test

OK, we will try the recent commit version to see if this issue is fixed

Feb 27 '23 09:02 nolanxyg

@nolanxyg Thanks, some op-node and L2 config would also be appreciated. E.g. what's you L2 block time? Note BSC seems to have a block time of 3 sec.

With your batcher configuration, it will have a hard time catching up because of the low target of 624 bytes. That value is/was set in our tests to force eager batch generation. But in realistic scenarios with high L2 volume, it's too low because each channel will only contain ~624 bytes of input data (and so even less output data because it gets compressed) and batcher txs are send sequentially.

Yesterday I added a new channel duration limiting functionality to the batcher at https://github.com/ethereum-optimism/optimism/pull/4990

I suggest you test your system with this new batcher and also change the config as suggested in this PR's description, but with higher max channel duration and sub safety margin. It allows you to set a low channel duration so that channels are not kept open for too long (which gives a better UX at the expense of emptier channels during low L2 tx volume). This could be set to e.g. MaxChannelDuration = 40 in a realistic setting to ensure that single-frame channels are submitted to L1 after at most 2 minutes (40 * 3 sec (L1 block time) = 2 min). But I don't know your requirements so whether you want to use channel duration limiting at all depends on your use case. My guess is you want to enable it, especially in your testnet.

Then also increase the target tx size to 100000 and use a realistic value for the approx. compression ratio (0.4). Note also that newer batchers pull the channel timeout from the rollup node, so you can delete that parameter (OP_BATCHER_CHANNEL_TIMEOUT).

tl;dr

Use batcher from #4990 and try with the following config (convert to env var names)

		MaxChannelDuration:        40, // 40 * 3 sec (L1 block time) = 2 min
		MaxL1TxSize:               120_000,
		TargetL1TxSize:            100_000,
		TargetNumFrames:           1,
		ApproxComprRatio:          0.4,
		SubSafetyMargin:           20,

This is similar to our e2e test config, but has a higher max channel duration and submission safety margin.

Feb 27 '23 11:02 sebastianst

Cool, Thanks a ton! @sebastianst 👍

Here's rollup.json:

{
  "genesis": {
    "l1": {
      "hash": "0x32a66f95f9e04e3eb3bc71b4b02b26ff1280ce8132aa0d58e9cd9f5bcfe6f2e3",
      "number": 1740287
    },
    "l2": {
      "hash": "0x74207c6f3ede9fa5e0db10aee7a3e13335978fbf07ef816b4c0bb964133ac0fa",
      "number": 0
    },
    "l2_time": 1677489770,
    "system_config": {
      "batcherAddr": "0xd532d4705df2f6422b02aba6a55955f21aab2294",
      "overhead": "0x0000000000000000000000000000000000000000000000000000000000000834",
      "scalar": "0x00000000000000000000000000000000000000000000000000000000000f4240",
      "gasLimit": 100000000
    }
  },
  "block_time": 1,
  "max_sequencer_drift": 300,
  "seq_window_size": 16800,
  "channel_timeout": 40,
  "l1_chain_id": 714,
  "l2_chain_id": 716,
  "batch_inbox_address": "0xff00000000000000000000000000000000000000",
  "deposit_contract_address": "0xe4562148368da2e2f94b2c590c17b550c79eb5c6",
  "l1_system_config_address": "0x9609d0e7084e59f9b779f9976e9c07c039f8a026"
}

For geth-config.toml, i have ran geth dumpconfig but got an error, which is same with: https://github.com/ethereum/go-ethereum/issues/26328, so i only could show the start command:

geth \
                  --datadir="/db" \
                  --verbosity=3 \
                  --http \
                  --http.corsdomain="*" \
                  --http.vhosts="*" \
                  --http.addr=0.0.0.0 \
                  --http.port=8545 \
                  --http.api=eth,engine \
                  --ws \
                  --ws.addr=0.0.0.0 \
                  --ws.port=8546 \
                  --ws.origins="*" \
                  --ws.api=eth,engine \
                  --syncmode=full \
                  --nodiscover \
                  --maxpeers=1 \
                  --networkid=$(CHAIN_ID) \
                  --unlock="$(BLOCK_SIGNER_ADDRESS)" \
                  --mine \
                  --miner.etherbase="$(BLOCK_SIGNER_ADDRESS)" \
                  --miner.gaslimit=25000000 \
                  --password=/db/password \
                  --allow-insecure-unlock \
                  --authrpc.addr="0.0.0.0" \
                  --authrpc.port="8551" \
                  --authrpc.vhosts="*" \
                  --authrpc.jwtsecret=/config/jwt-secret.txt \
                  --gcmode=archive \
                  --verbosity=4 \
                  --pprof \
                  --metrics \
                  --metrics.port 6060 \
                  --metrics.addr 0.0.0.0

Thanks for pointing 624 bytes out! We also found that's too small, and we'll optimize the parameters according to your advice, really appreciate!

We have modified the block_time to 1s, and noticed that default NewPayloadTimeout parameter value(2s) of op-geth may be inconsistent, should we modify NewPayloadTimeout to 1s as well?

Feb 28 '23 10:02 nolanxyg

Quick feedback on your rollup config:

a channel timeout of 40 (2 min in your case) seems rather low. It's the value I proposed setting as max channel duration in the batcher, but as a consensus variable I'd set this higher, e.g. at least to 120 (6 min ).
why the very high seq window size of 16800?

We'll get back regarding the NewPayloadTimeout soon.

Looking forward to your feedback on the new batcher!

Feb 28 '23 11:02 sebastianst

Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit https://github.com/ethereum-optimism/optimism/commit/173181bd7f45aae9225523071defd5cf202fd812.

Hi @sebastianst, we have tried again with code version: https://github.com/ethereum-optimism/optimism/releases/tag/op-proposer%2Fv1.0.0-rc.1, and the frameLength is too large error came up again

Here's what we did:

Setup the whole network (config with info mentioned above)
Stop batcher for several hours (or batcher run out of money in our case as showed in below monitor)

Start batcher (or fund to batcher in our case at 03-01 10:43, and batcher ran out of money soon, so we fund to it again at 03-01 11:29)
check the log of op-node, we found frameLength is too large error log at about 03-01 10:43 and 03-01 11:29, which is the time batcher return to work

Mar 01 '23 08:03 nolanxyg

Quick feedback on your rollup config:

a channel timeout of 40 (2 min in your case) seems rather low. It's the value I proposed setting as max channel duration in the batcher, but as a consensus variable I'd set this higher, e.g. at least to 120 (6 min ).

why the very high seq window size of 16800?

We'll get back regarding the NewPayloadTimeout soon.

Looking forward to your feedback on the new batcher!

got, how about set op_batcher__channel_timeout to 40*12/3 and channel_timieout in rollup.json to 120*12/3?
- 12 means Ethereum block time , and 3 means BSC block time
the seq_window_size value in Ethereum environment is 3600, we considered BSC block time is 3s, so 3600*12/3 = 14400 ≈ 16800

Mar 01 '23 09:03 nolanxyg

We have modified the block_time to 1s, and noticed that default NewPayloadTimeout parameter value(2s) of op-geth may be inconsistent, should we modify NewPayloadTimeout to 1s as well?

No, payload-building timeouts should be generous. If it's too strict, there's a chance it gets stuck on a valid canonical block that takes a little too much time to build. In L1, at 30M gas, we generally regard the max building time as around 2 seconds. A 2 second block-time is already tight, and the tuned EIP-1559 avoid too many repeated gas-bursts from repeatedly hitting this 2 second build time, by targeting less gas on average, and exponentially adjusting it upwards.

If you fall behind a little bit due to a large block, with the right gas-adjustment parameters the next ones will be more light-weight, will take less time to build, and thus get the L2 chain back in sync with the wallclock.

Mar 01 '23 19:03 protolambda

Quick feedback on your rollup config:

a channel timeout of 40 (2 min in your case) seems rather low. It's the value I proposed setting as max channel duration in the batcher, but as a consensus variable I'd set this higher, e.g. at least to 120 (6 min ).

why the very high seq window size of 16800?

We'll get back regarding the NewPayloadTimeout soon. Looking forward to your feedback on the new batcher!
* got, how about set `op_batcher__channel_timeout` to `40*12/3` and ` channel_timieout` in rollup.json to `120*12/3`?
  
  * `12` means Ethereum block time , and `3` means BSC block time

* the `seq_window_size` value in Ethereum environment is [3600](https://github.com/ethereum-optimism/optimism/blob/3177f8850ed94c7935032155f94dd2629d809f96/op-node/chaincfg/chains.go#L62), we considered BSC block time is 3s, so `3600*12/3 = 14400 ≈ 16800`

The new batcher queries the channel timeout from its rollup node, so this parameter (OP_BATCHER_CHANNEL_TIMEOUT) is obsolete.
A channel timeout of 480 in your rollup.json sounds reasonable then. 3600*4=16800 for the SWS should be fine too.
Did you try the batcher from #4990 (commit c8f16b5) yet? Try it with this config

      OP_BATCHER_MAX_CHANNEL_DURATION: 20 # 1min, adjust if you want longer
      OP_BATCHER_MAX_L1_TX_SIZE_BYTES: 120000
      OP_BATCHER_TARGET_L1_TX_SIZE_BYTES: 100000
      OP_BATCHER_TARGET_NUM_FRAMES: 1
      OP_BATCHER_APPROX_COMPR_RATIO: 0.6 # note, updated from 0.4
      OP_BATCHER_SUB_SAFETY_MARGIN: 20

The high target and max tx size make sure that the batcher sends full frames when it's catching up after being offline for some time.

Mar 01 '23 22:03 sebastianst

@nolanxyg I looked at your logs screenshot. It's conspicuous that the error at 10:43 mentions a frame number of 14. Because the batcher should actually just send txs whose data only includes 1-2 frame(s) with your config (OP_BATCHER_TARGET_NUM_FRAMES: 1). So it looks to me like the frame data is total garbage. But then it seems like a miracle that parsing gets until "frame 14" and the 13 "frames" before that unmarshal without error. Also, the first byte of the tx data must be correctly set to 0 = DerivationVersion0.

I'm curious if you still get errors with the #4990 batcher. Also, could you enable batcher debug logging (OP_BATCHER_LOG_LEVEL: "debug") and upload the logs near before and after the time when the node detects broken batcher tx data?

Mar 01 '23 22:03 sebastianst

No, payload-building timeouts should be generous. If it's too strict, there's a chance it gets stuck on a valid canonical block that takes a little too much time to build. In L1, at 30M gas, we generally regard the max building time as around 2 seconds. A 2 second block-time is already tight, and the tuned EIP-1559 avoid too many repeated gas-bursts from repeatedly hitting this 2 second build time, by targeting less gas on average, and exponentially adjusting it upwards.

If you fall behind a little bit due to a large block, with the right gas-adjustment parameters the next ones will be more light-weight, will take less time to build, and thus get the L2 chain back in sync with the wallclock.

Nice explanation, understood, thanks! @protolambda 👍

We set l2_block_time to 1s considering there's no consensus on l2 for now, so we tried a aggressive value. Does 1s block time have potential risks? (If so, we'll change back to 2s)

Mar 02 '23 06:03 nolanxyg

Did you try the batcher from https://github.com/ethereum-optimism/optimism/pull/4990 (commit c8f16b5) yet? Try it with this config

     OP_BATCHER_MAX_CHANNEL_DURATION: 20 # 1min, adjust if you want longer
     OP_BATCHER_MAX_L1_TX_SIZE_BYTES: 120000
     OP_BATCHER_TARGET_L1_TX_SIZE_BYTES: 100000
     OP_BATCHER_TARGET_NUM_FRAMES: 1
     OP_BATCHER_APPROX_COMPR_RATIO: 0.6 # note, updated from 0.4
     OP_BATCHER_SUB_SAFETY_MARGIN: 20

OK, we'll try this, and give the feedback soon

Also, could you enable batcher debug logging (OP_BATCHER_LOG_LEVEL: "debug") and upload the logs near before and after the time when the node detects broken batcher tx data?

Sure

Mar 02 '23 07:03 nolanxyg

@nolanxyg #4990 just got merged into develop, so you can use the batcher from develop now.

Mar 02 '23 09:03 sebastianst

I might want to reproduce this locally. Can you estimate the minimal steps to reproduce this? If I spin up a new network, but don't start the batcher yet, produce a good amount of L2 tx volume, then spin up the batcher after a few hours, would that already trigger it?

Mar 02 '23 09:03 sebastianst

If to be the same with our steps, i think you could spin up a network, and test everything worked properly(batcher worked fine as well, but the balance of batcher better be low), produce a good amount of L2 tx volume, then batcher will be out of money soon, after a few hours(e.g. a night), fund to batcher to get it back to work, finally check if the network recovered and if the frame error log show up

Because i doubt that maybe the interrupt of sending submit-tx(due to insufficient funds) cause the error

ERROR[02-26|09:11:42.718] unable to publish transaction            service=batcher txHash=9aadf5..d71a8f nonce=55244 gasTipCap=20,000,000,000 gasFeeCap=20,000,000,000 err="insufficient funds for gas * price + value"

Mar 02 '23 10:03 nolanxyg

@sebastianst I think the pr fix the bug, please take a look

Mar 03 '23 14:03 s7v7nislands

I might want to reproduce this locally. Can you estimate the minimal steps to reproduce this? If I spin up a new network, but don't start the batcher yet, produce a good amount of L2 tx volume, then spin up the batcher after a few hours, would that already trigger it?

You can change the code in txmgr.go, always return an error in SendTransaction(), and print the data in nextTxData, you will find the version byte is repeated repeatedly.

Mar 03 '23 14:03 s7v7nislands

Thanks! That should fix it indeed! I'll also take a closer look whether it makes sense to clean this up with a small refactor, because this prepending and appending of version zeros is quite error prone 😅

Mar 03 '23 14:03 sebastianst

optimism optimism copied to clipboard

[BUG] op-node parse L1 batch data error: `frameLength is too large: 2838834252`

tl;dr

optimism
optimism copied to clipboard