possible bug in op-node sequencer recover mode
On a partner's testnet, recover mode was enabled on the sequencer during normal operation (i.e. not after a sequencing window expiry / autoderivation situation). The following logs were then seen on verifier nodes:
t=2025-11-18T22:17:00+0000 lvl=info msg="decoded singular batch from channel" batch_type=SingularBatch batch_timestamp=1763488130 parent_hash=0x303a8d56bc568ee033ef69ba1135d8c1f30e472f97b9f85f1ae136956efb0cfe batch_epoch=0x1d01afb5316a7640d6b6a0fd36dc42c0a2e01b7b95bb6f701e2789be0aeff341:9656474 txs=0 compression_algo=brotli stage_origin=0x3c4778595a56b74d95e54179ea21c8c9b052f2c523d851f49f116b7f1b2a0734:9657864
t=2025-11-18T22:17:00+0000 lvl=info msg="batch exceeded sequencer time drift without adopting next origin, and next L1 origin would have been valid" origin=0x3c4778595a56b74d95e54179ea21c8c9b052f2c523d851f49f116b7f1b2a0734:9657864 epoch=0x1d01afb5316a7640d6b6a0fd36dc42c0a2e01b7b95bb6f701e2789be0aeff341:9656474 batch_type=SingularBatch batch_timestamp=1763488130 parent_hash=0x303a8d56bc568ee033ef69ba1135d8c1f30e472f97b9f85f1ae136956efb0cfe batch_epoch=0x1d01afb5316a7640d6b6a0fd36dc42c0a2e01b7b95bb6f701e2789be0aeff341:9656474 txs=0
t=2025-11-18T22:17:00+0000 lvl=warn msg="Dropping invalid singular batch, flushing channel" origin=0x3c4778595a56b74d95e54179ea21c8c9b052f2c523d851f49f116b7f1b2a0734:9657864 epoch=0x1d01afb5316a7640d6b6a0fd36dc42c0a2e01b7b95bb6f701e2789be0aeff341:9656474 batch_type=SingularBatch batch_timestamp=1763488130 parent_hash=0x303a8d56bc568ee033ef69ba1135d8c1f30e472f97b9f85f1ae136956efb0cfe batch_epoch=0x1d01afb5316a7640d6b6a0fd36dc42c0a2e01b7b95bb6f701e2789be0aeff341:9656474 txs=0
t=2025-11-18T22:17:00+0000 lvl=info msg="decoded singular batch from channel" batch_type=SingularBatch batch_timestamp=1763488130 parent_hash=0x303a8d56bc568ee033ef69ba1135d8c1f30e472f97b9f85f1ae136956efb0cfe batch_epoch=0x1d01afb5316a7640d6b6a0fd36dc42c0a2e01b7b95bb6f701e2789be0aeff341:9656474 txs=0 compression_algo=brotli stage_origin=0x3c4778595a56b74d95e54179ea21c8c9b052f2c523d851f49f116b7f1b2a0734:9657864
Suggesting that the sequencer violated the sequencer drift rules https://specs.optimism.io/protocol/fjord/derivation.html?highlight=sequencer%20drift#constant-maximum-sequencer-drift
Other context:
The sequencer also logged
t=2025-11-18T17:19:46+0000 lvl=error msg="Error finding next L1 Origin" err="temp: failed to fetch next L1 origin: not found"
as a recurring pattern (but also during recover mode) .
Conversation with @sebastianst today.
It seems that when the chain is "healthy" (sequencing window not expired), but the sequencer is in recover mode, the sequencer will request an l1 origin block which does not yet exist. Hence the "not found" error above. And in that condition, the l1 origin is not being advanced at all and this ends up violating the sequencer drift rule.
When that happens and a bad batch gets on L1, derivation will stall. There are a couple of options for recovery, 1) rewind the sequencer and turn off recover mode or 2) turn of recover mode and allow the bad batch to be reorged out.
Possible fix snippet to make recovery mode not produce temp errors when it gets up to tip:
func (los *L1OriginSelector) CurrentAndNextOrigin(ctx context.Context, l2Head eth.L2BlockRef) (eth.L1BlockRef, eth.L1BlockRef, error) {
los.mu.Lock()
defer los.mu.Unlock()
if los.recoverMode.Load() {
currentOrigin, err := los.l1.L1BlockRefByHash(ctx, l2Head.L1Origin.Hash)
if err != nil {
return eth.L1BlockRef{}, eth.L1BlockRef{},
derive.NewTemporaryError(fmt.Errorf("failed to fetch current L1 origin: %w", err))
}
los.currentOrigin = currentOrigin
nextOrigin, err := los.l1.L1BlockRefByNumber(ctx, currentOrigin.Number+1)
if errors.Is(err, ethereum.NotFound) {
los.nextOrigin = eth.L1BlockRef{}
return los.currentOrigin, los.nextOrigin, nil
} else if err != nil {
return eth.L1BlockRef{}, eth.L1BlockRef{},
derive.NewTemporaryError(fmt.Errorf("failed to fetch next L1 origin: %w", err))
}
I still haven't figured out why the temporary error returned by the current implementation doesn't eventually go away when that next l1 block becomes available.
It is true that an explicit check on the sequencer drift, which is usually applied during normal operation, is being skipped when recover mode runs into this temporary error. But even with Seb's suggestion above I believe we will still hit the temporary error for a while and then should eventually get past it and progress the l1 origin again.
There could be an influence from async l1 origin fetching added in this PR https://github.com/ethereum-optimism/optimism/pull/12134. Will continue to investigate.
Possibly relevant PR https://github.com/ethereum-optimism/optimism/pull/18233