optimism
optimism copied to clipboard
[BUG] op-node parse L1 batch data error: `frameLength is too large: 2838834252`
Describe the bug A clear and concise description of what the bug is.
Op-node parse batch data submitted by batcher encounter a irretrievable error: frameLength is too large: 2838834252, which will make op-node unable to reconcile with L1
To Reproduce Steps to reproduce the behavior:
- stop batcher for several hours
- start batcher
- op-node print error log:
Feb 22, 2023 @ 11:32:26.580 WARN [02-22|03:32:26.580] Failed to parse frames origin=f4c844..525ff1:1589279 err="frameLength is too large: 2838834252"
Expected behavior A clear and concise description of what you expected to happen.
Op-node should keep reconcile with L1
Screenshots If applicable, add screenshots to help explain your problem.
System Specs:
- OS: Ubuntu
- Package Version (or commit hash): 173181bd7f45aae9225523071defd5cf202fd812
Additional context Add any other context about the problem here.
This error looks wrong to me: the frameLength should never be that large. I think it hit a decoding error, and read the wrong data as frame-length.
In the verifier it was recently changed (thus also the different error message formatting) due to a finding of the Sherlock bug-bounty contest; https://github.com/ethereum-optimism/optimism/pull/4867 which may be related.
But the error path is still the same:
- The frame is too large: https://github.com/ethereum-optimism/optimism/blob/72f24f11d21c167108082fdbe5bdffb3293348ed/op-node/rollup/derive/frame.go#L90
- The frame decoding errors: https://github.com/ethereum-optimism/optimism/blob/72f24f11d21c167108082fdbe5bdffb3293348ed/op-node/rollup/derive/frame.go#L140
- The bad frame is ignored: https://github.com/ethereum-optimism/optimism/blob/9a198e90b1a7ab90a2ca85a3277576103eba6548/op-node/rollup/derive/frame_queue.go#L45
This verifier code-path here functions as expected: bad frames are dropped and skipped over.
The batcher however needs to ensure it only outputs valid frames, or the frames will not be accepted.
The maximum frame size is configurable: https://github.com/ethereum-optimism/optimism/blob/e03b50731e0718a311f4f59e0699c90e6a598759/op-batcher/batcher/channel_builder.go#L26
And based on the max L1 tx size by default: https://github.com/ethereum-optimism/optimism/blob/3468c6af6df8132c848f6f329a0a5592e6a14800/op-batcher/batcher/driver.go#L92
Which is smaller than the enforced frame size: https://github.com/ethereum-optimism/optimism/blob/72f24f11d21c167108082fdbe5bdffb3293348ed/op-node/rollup/derive/frame.go#L14
So assuming the settings are left to defaults, then this may indicate some bug (that may be related or duplicate of the Sherlock contest finding) in the frame-encoding when buffering large amounts of data that corrupts the output (resulting in the ridiculous 2.8+ GB frame-size number).
cc @sebastianst flagging possible op-batcher bug.
Thanks @protolambda for the initial analysis!
The frameLength of 2838834252 looks suspicious to me because it has the highest uint32 bit set: I think it never was a frameLength to begin with but instead is just random data that got decoded into a frame length. The unmarshaling code doesn't do any sanity check before this first check on the max. frame length. I bet the decoded frame number and channel id before are also random.
@nolanxyg could you provide some more context?
- What op-batcher and op-node configuration do you use?
- In particular, what channel configuration (target and max tx size, approx. compr. ratio)
- What's your L1 here?
- Is there any tx volume happening during the hours that your batcher is offline?
Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit https://github.com/ethereum-optimism/optimism/commit/173181bd7f45aae9225523071defd5cf202fd812.
Tracking at CLI-3428
@nolanxyg see my above message, would love to get more info so that we can reproduce ππ»
Sorryππ» @s7v7nislands pls help to provide more context, thanks
@nolanxyg could you provide some more context?
What op-batcher and op-node configuration do you use?
- In particular, what channel configuration (target and max tx size, approx. compr. ratio)
What's your L1 here?
Is there any tx volume happening during the hours that your batcher is offline?
Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit 173181b.
- BSC
- Yes, we were doing load test
OK, we will try the recent commit version to see if this issue is fixed
@nolanxyg Thanks, some op-node and L2 config would also be appreciated. E.g. what's you L2 block time? Note BSC seems to have a block time of 3 sec.
With your batcher configuration, it will have a hard time catching up because of the low target of 624 bytes. That value is/was set in our tests to force eager batch generation. But in realistic scenarios with high L2 volume, it's too low because each channel will only contain ~624 bytes of input data (and so even less output data because it gets compressed) and batcher txs are send sequentially.
Yesterday I added a new channel duration limiting functionality to the batcher at https://github.com/ethereum-optimism/optimism/pull/4990
I suggest you test your system with this new batcher and also change the config as suggested in this PR's description, but with higher max channel duration and sub safety margin. It allows you to set a low channel duration so that channels are not kept open for too long (which gives a better UX at the expense of emptier channels during low L2 tx volume). This could be set to e.g. MaxChannelDuration = 40 in a realistic setting to ensure that single-frame channels are submitted to L1 after at most 2 minutes (40 * 3 sec (L1 block time) = 2 min). But I don't know your requirements so whether you want to use channel duration limiting at all depends on your use case. My guess is you want to enable it, especially in your testnet.
Then also increase the target tx size to 100000 and use a realistic value for the approx. compression ratio (0.4). Note also that newer batchers pull the channel timeout from the rollup node, so you can delete that parameter (OP_BATCHER_CHANNEL_TIMEOUT).
tl;dr
Use batcher from #4990 and try with the following config (convert to env var names)
MaxChannelDuration: 40, // 40 * 3 sec (L1 block time) = 2 min
MaxL1TxSize: 120_000,
TargetL1TxSize: 100_000,
TargetNumFrames: 1,
ApproxComprRatio: 0.4,
SubSafetyMargin: 20,
This is similar to our e2e test config, but has a higher max channel duration and submission safety margin.
Cool, Thanks a ton! @sebastianst π
Here's rollup.json:
{
"genesis": {
"l1": {
"hash": "0x32a66f95f9e04e3eb3bc71b4b02b26ff1280ce8132aa0d58e9cd9f5bcfe6f2e3",
"number": 1740287
},
"l2": {
"hash": "0x74207c6f3ede9fa5e0db10aee7a3e13335978fbf07ef816b4c0bb964133ac0fa",
"number": 0
},
"l2_time": 1677489770,
"system_config": {
"batcherAddr": "0xd532d4705df2f6422b02aba6a55955f21aab2294",
"overhead": "0x0000000000000000000000000000000000000000000000000000000000000834",
"scalar": "0x00000000000000000000000000000000000000000000000000000000000f4240",
"gasLimit": 100000000
}
},
"block_time": 1,
"max_sequencer_drift": 300,
"seq_window_size": 16800,
"channel_timeout": 40,
"l1_chain_id": 714,
"l2_chain_id": 716,
"batch_inbox_address": "0xff00000000000000000000000000000000000000",
"deposit_contract_address": "0xe4562148368da2e2f94b2c590c17b550c79eb5c6",
"l1_system_config_address": "0x9609d0e7084e59f9b779f9976e9c07c039f8a026"
}
For geth-config.toml, i have ran geth dumpconfig but got an error, which is same with: https://github.com/ethereum/go-ethereum/issues/26328, so i only could show the start command:
geth \
--datadir="/db" \
--verbosity=3 \
--http \
--http.corsdomain="*" \
--http.vhosts="*" \
--http.addr=0.0.0.0 \
--http.port=8545 \
--http.api=eth,engine \
--ws \
--ws.addr=0.0.0.0 \
--ws.port=8546 \
--ws.origins="*" \
--ws.api=eth,engine \
--syncmode=full \
--nodiscover \
--maxpeers=1 \
--networkid=$(CHAIN_ID) \
--unlock="$(BLOCK_SIGNER_ADDRESS)" \
--mine \
--miner.etherbase="$(BLOCK_SIGNER_ADDRESS)" \
--miner.gaslimit=25000000 \
--password=/db/password \
--allow-insecure-unlock \
--authrpc.addr="0.0.0.0" \
--authrpc.port="8551" \
--authrpc.vhosts="*" \
--authrpc.jwtsecret=/config/jwt-secret.txt \
--gcmode=archive \
--verbosity=4 \
--pprof \
--metrics \
--metrics.port 6060 \
--metrics.addr 0.0.0.0
Thanks for pointing 624 bytes out! We also found that's too small, and we'll optimize the parameters according to your advice, really appreciate!
We have modified the block_time to 1s, and noticed that default NewPayloadTimeout parameter value(2s) of op-geth may be inconsistent, should we modify NewPayloadTimeout to 1s as well?
Quick feedback on your rollup config:
- a channel timeout of 40 (2 min in your case) seems rather low. It's the value I proposed setting as max channel duration in the batcher, but as a consensus variable I'd set this higher, e.g. at least to 120 (6 min ).
- why the very high seq window size of 16800?
We'll get back regarding the NewPayloadTimeout soon.
Looking forward to your feedback on the new batcher!
Please also give this a try with a more recent version of the monorepo. A few changes have been made to the batcher and node since your specified commit https://github.com/ethereum-optimism/optimism/commit/173181bd7f45aae9225523071defd5cf202fd812.
Hi @sebastianst, we have tried again with code version: https://github.com/ethereum-optimism/optimism/releases/tag/op-proposer%2Fv1.0.0-rc.1, and the frameLength is too large error came up again
Here's what we did:
- Setup the whole network (config with info mentioned above)
- Stop batcher for several hours (or batcher run out of money in our case as showed in below monitor)

- Start batcher (or fund to batcher in our case at 03-01 10:43, and batcher ran out of money soon, so we fund to it again at 03-01 11:29)
- check the log of op-node, we found
frameLength is too largeerror log at about 03-01 10:43 and 03-01 11:29, which is the time batcher return to work
Quick feedback on your rollup config:
- a channel timeout of 40 (2 min in your case) seems rather low. It's the value I proposed setting as max channel duration in the batcher, but as a consensus variable I'd set this higher, e.g. at least to 120 (6 min ).
- why the very high seq window size of 16800?
We'll get back regarding the
NewPayloadTimeoutsoon.Looking forward to your feedback on the new batcher!
-
got, how about set
op_batcher__channel_timeoutto40*12/3andchannel_timieoutin rollup.json to120*12/3?12means Ethereum block time , and3means BSC block time
-
the
seq_window_sizevalue in Ethereum environment is 3600, we considered BSC block time is 3s, so3600*12/3 = 14400 β 16800
We have modified the block_time to 1s, and noticed that default NewPayloadTimeout parameter value(2s) of op-geth may be inconsistent, should we modify NewPayloadTimeout to 1s as well?
No, payload-building timeouts should be generous. If it's too strict, there's a chance it gets stuck on a valid canonical block that takes a little too much time to build. In L1, at 30M gas, we generally regard the max building time as around 2 seconds. A 2 second block-time is already tight, and the tuned EIP-1559 avoid too many repeated gas-bursts from repeatedly hitting this 2 second build time, by targeting less gas on average, and exponentially adjusting it upwards.
If you fall behind a little bit due to a large block, with the right gas-adjustment parameters the next ones will be more light-weight, will take less time to build, and thus get the L2 chain back in sync with the wallclock.
Quick feedback on your rollup config:
- a channel timeout of 40 (2 min in your case) seems rather low. It's the value I proposed setting as max channel duration in the batcher, but as a consensus variable I'd set this higher, e.g. at least to 120 (6 min ).
- why the very high seq window size of 16800?
We'll get back regarding the
NewPayloadTimeoutsoon. Looking forward to your feedback on the new batcher!* got, how about set `op_batcher__channel_timeout` to `40*12/3` and ` channel_timieout` in rollup.json to `120*12/3`? * `12` means Ethereum block time , and `3` means BSC block time * the `seq_window_size` value in Ethereum environment is [3600](https://github.com/ethereum-optimism/optimism/blob/3177f8850ed94c7935032155f94dd2629d809f96/op-node/chaincfg/chains.go#L62), we considered BSC block time is 3s, so `3600*12/3 = 14400 β 16800`
- The new batcher queries the channel timeout from its rollup node, so this parameter (
OP_BATCHER_CHANNEL_TIMEOUT) is obsolete. - A channel timeout of 480 in your
rollup.jsonsounds reasonable then. 3600*4=16800 for the SWS should be fine too. - Did you try the batcher from #4990 (commit c8f16b5) yet? Try it with this config
OP_BATCHER_MAX_CHANNEL_DURATION: 20 # 1min, adjust if you want longer
OP_BATCHER_MAX_L1_TX_SIZE_BYTES: 120000
OP_BATCHER_TARGET_L1_TX_SIZE_BYTES: 100000
OP_BATCHER_TARGET_NUM_FRAMES: 1
OP_BATCHER_APPROX_COMPR_RATIO: 0.6 # note, updated from 0.4
OP_BATCHER_SUB_SAFETY_MARGIN: 20
The high target and max tx size make sure that the batcher sends full frames when it's catching up after being offline for some time.
@nolanxyg I looked at your logs screenshot. It's conspicuous that the error at 10:43 mentions a frame number of 14. Because the batcher should actually just send txs whose data only includes 1-2 frame(s) with your config (OP_BATCHER_TARGET_NUM_FRAMES: 1). So it looks to me like the frame data is total garbage. But then it seems like a miracle that parsing gets until "frame 14" and the 13 "frames" before that unmarshal without error. Also, the first byte of the tx data must be correctly set to 0 = DerivationVersion0.
I'm curious if you still get errors with the #4990 batcher. Also, could you enable batcher debug logging (OP_BATCHER_LOG_LEVEL: "debug") and upload the logs near before and after the time when the node detects broken batcher tx data?
No, payload-building timeouts should be generous. If it's too strict, there's a chance it gets stuck on a valid canonical block that takes a little too much time to build. In L1, at 30M gas, we generally regard the max building time as around 2 seconds. A 2 second block-time is already tight, and the tuned EIP-1559 avoid too many repeated gas-bursts from repeatedly hitting this 2 second build time, by targeting less gas on average, and exponentially adjusting it upwards.
If you fall behind a little bit due to a large block, with the right gas-adjustment parameters the next ones will be more light-weight, will take less time to build, and thus get the L2 chain back in sync with the wallclock.
Nice explanation, understood, thanks! @protolambda π
We set l2_block_time to 1s considering there's no consensus on l2 for now, so we tried a aggressive value. Does 1s block time have potential risks? (If so, we'll change back to 2s)
Did you try the batcher from https://github.com/ethereum-optimism/optimism/pull/4990 (commit c8f16b5) yet? Try it with this config
OP_BATCHER_MAX_CHANNEL_DURATION: 20 # 1min, adjust if you want longer OP_BATCHER_MAX_L1_TX_SIZE_BYTES: 120000 OP_BATCHER_TARGET_L1_TX_SIZE_BYTES: 100000 OP_BATCHER_TARGET_NUM_FRAMES: 1 OP_BATCHER_APPROX_COMPR_RATIO: 0.6 # note, updated from 0.4 OP_BATCHER_SUB_SAFETY_MARGIN: 20
OK, we'll try this, and give the feedback soon
Also, could you enable batcher debug logging (OP_BATCHER_LOG_LEVEL: "debug") and upload the logs near before and after the time when the node detects broken batcher tx data?
Sure
@nolanxyg #4990 just got merged into develop, so you can use the batcher from develop now.
I might want to reproduce this locally. Can you estimate the minimal steps to reproduce this? If I spin up a new network, but don't start the batcher yet, produce a good amount of L2 tx volume, then spin up the batcher after a few hours, would that already trigger it?
If to be the same with our steps, i think you could spin up a network, and test everything worked properly(batcher worked fine as well, but the balance of batcher better be low), produce a good amount of L2 tx volume, then batcher will be out of money soon, after a few hours(e.g. a night), fund to batcher to get it back to work, finally check if the network recovered and if the frame error log show up
Because i doubt that maybe the interrupt of sending submit-tx(due to insufficient funds) cause the error
ERROR[02-26|09:11:42.718] unable to publish transaction service=batcher txHash=9aadf5..d71a8f nonce=55244 gasTipCap=20,000,000,000 gasFeeCap=20,000,000,000 err="insufficient funds for gas * price + value"
@sebastianst I think the pr fix the bug, please take a look
I might want to reproduce this locally. Can you estimate the minimal steps to reproduce this? If I spin up a new network, but don't start the batcher yet, produce a good amount of L2 tx volume, then spin up the batcher after a few hours, would that already trigger it?
You can change the code in txmgr.go, always return an error in SendTransaction(), and print the data in nextTxData, you will find the version byte is repeated repeatedly.
Thanks! That should fix it indeed! I'll also take a closer look whether it makes sense to clean this up with a small refactor, because this prepending and appending of version zeros is quite error prone π