rollmint
rollmint copied to clipboard
Rollup stops posting to DA layer after a while
Version of Git SHA
Observed the error on https://github.com/rollkit/rollkit/pull/1424, however it probably occurs on main as well.
System OS
Mac OS
Steps to reproduce it
Start a gm rollup, non-lazy aggregation, let it run for several hours. Eventually, the successfully submitted Rollkit block to DA layer
messages stop appearing in the logs, but blocks still seem to get gossiped, stored, and indexed.
Expected result
successfully submitted Rollkit block to DA layer
log messages keep up with the chain
Actual result
rollup seems to stop posting blocks to DA.
Notes
If it's failing silently, we need to ensure that the sequencer stops producing blocks, or else it becomes at risk of progressing too far for DA to keep up.
address the underlying cause if possible, and don't fail silently
it seems to start after this throws. perhaps it's not properly recovering from incorrect account sequence
?
ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence"
how long is a while? I noticed that between batches posting on Polaris, the time went from every block or so, to every minute, to every few minutes, eventually lagging 15 minutes between batch postings. during that time rollkit was still producing blocks, some 1000 extra blocks, with no warnings or errors then, after posting the last batch, the chain crashed after running for around one hour with the same error that @mycodecrafting is seeing:
5:49PM ERR failed to prepare proposal err="invalid timestamp, parent 1706914174 given 1706914174" height=3573 module=server
i originally suggested to @mycodecrafting to change the block time to 2s, but i don't think changing the block time will do anything but slow down the rate at which the chain fails. it'll inevitably hit the point where it begins to lag so much, based on the number of blobs being posted IMO, that it can't fit them into a pfb and it gets stuck in mempool causing issues with acc sequence errors, eventually resulting in failed txs. (namespace on celenium to see the different batch postings click the "messages" tab)
root cause
5:49PM ERR failed to build payload err="invalid timestamp, parent [1706914174](tel:1706914174) given [1706914174](tel:1706914174)" module=server
5:49PM ERR failed to prepare proposal err="invalid timestamp, parent [1706914174](tel:1706914174) given [1706914174](tel:1706914174)" height=3573 module=server
5:49PM ERR failed to process proposal err="failed to find envelope in proposal" hash=2A4784028005D12EBD413A5BCDAC7B470DFF4B801D74D202BD0813B54E9C953A height=3573 module=server
panic: error while processing the proposal: <nil>
goroutine 356 [running]:
[github.com/rollkit/rollkit/block.(*Manager).publishBlock(0x140006a92c0](http://github.com/rollkit/rollkit/block.(*Manager).publishBlock(0x140006a92c0), {0x1049062e0, 0x14001454730?})
[github.com/rollkit/[email protected]/block/manager.go:732](http://github.com/rollkit/[email protected]/block/manager.go:732) +0xbd8
[github.com/rollkit/rollkit/block.(*Manager).AggregationLoop(0x140006a92c0](http://github.com/rollkit/rollkit/block.(*Manager).AggregationLoop(0x140006a92c0), {0x1049062e0, 0x14001454730}, 0x0)
[github.com/rollkit/[email protected]/block/manager.go:274](http://github.com/rollkit/[email protected]/block/manager.go:274) +0x1d0
[github.com/rollkit/rollkit/node.(*FullNode).OnStart.func1()](http://github.com/rollkit/rollkit/node.(*FullNode).OnStart.func1())
[github.com/rollkit/[email protected]/node/full.go:379](http://github.com/rollkit/[email protected]/node/full.go:379) +0x30
[github.com/rollkit/rollkit/types.(*ThreadManager).Go.func1()](http://github.com/rollkit/rollkit/types.(*ThreadManager).Go.func1())
[github.com/rollkit/[email protected]/types/threadmanager.go:26](http://github.com/rollkit/[email protected]/types/threadmanager.go:26) +0x58
created by [github.com/rollkit/rollkit/types.(*ThreadManager).Go](http://github.com/rollkit/rollkit/types.(*ThreadManager).Go) in goroutine 1
[github.com/rollkit/[email protected]/types/threadmanager.go:24](http://github.com/rollkit/[email protected]/types/threadmanager.go:24) +0x7c
make: * [start] Error 2
full logs & version info
versions i am using:
- rollkit-main of polaris
- celesta-da v0.12.9
- mocha network
- namespace on celenium
- block time: 1s
- os: macos
polaris logs: https://app.warp.dev/block/rgCrw9OZY4TPapRK2gkyth light node logs: https://app.warp.dev/block/tGCV5MLNTxp1tj2s5f1Bfa
testing with gm world rollup
i decided to leave the same test running overnight with gm world, using default versions in docs, v0.11.19 rollkit, v28.1.0 ignite
results
- chain posted first ~9000 blocks to DA
- significant lag between postings began in first few hours
- after this, "blob too big" errors began to show due to no logic in place to deal with backlog of blocks
- chain is still running, celestia-da has crashed, and chain has produced ~50000 more blocks than have not been posted to DA, with few error logs
- light node seems to have just gave up based on these full logs from the attempts to post blobs that were too
- full rollup logs (i manually stopped chain) show both "unable to get DA max blob size" and "failed to submit block to DA layer after 30 attempts", without any other errors saying "light node/DA isn't running anymore"
- namespace on celenium
Just attempted to manually test using this commit of celestia-da. I got this error output from Rollkit
2:35PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = rpc error: code = Unknown desc = timed out waiting for tx to be included in a block" attempt=1 module=BlockManager
2:35PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=2 module=BlockManager
2:35PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=3 module=BlockManager
2:35PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=4 module=BlockManager
2:35PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=5 module=BlockManager
2:35PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=6 module=BlockManager
The chain progresses despite the DA submission failure. We need to stop block production if pendingBlocks start to pile up past a certain number.
We also need to solve all the underlying errors with DA submission
including another example of poor dev ex relating to this problem:
▶ bash restart-local.sh
1:43PM INF starting node with ABCI CometBFT in-process module=server
1:43PM INF starting node with Rollkit in-process module=server
1:43PM INF service start impl=multiAppConn module=proxy msg="Starting multiAppConn service"
1:43PM INF service start connection=query impl=localClient module=abci-client msg="Starting localClient service"
1:43PM INF service start connection=snapshot impl=localClient module=abci-client msg="Starting localClient service"
1:43PM INF service start connection=mempool impl=localClient module=abci-client msg="Starting localClient service"
1:43PM INF service start connection=consensus impl=localClient module=abci-client msg="Starting localClient service"
1:43PM INF service start impl=EventBus module=events msg="Starting EventBus service"
1:43PM INF service start impl=PubSub module=pubsub msg="Starting PubSub service"
1:43PM INF service start impl=IndexerService module=txindex msg="Starting IndexerService service"
1:43PM INF service start impl=RPC module=server msg="Starting RPC service"
1:43PM INF service start impl=Node module=server msg="Starting Node service"
1:43PM INF starting P2P client module=server
1:43PM INF serving HTTP listen address=127.0.0.1:36657 module=server
1:43PM INF listening on address=/ip4/127.0.0.1/tcp/36656/p2p/12D3KooWELDpQFaaJqZBMNrL93KUHxVLsByEp5DJWuLhupJZ6x7F module=p2p
1:43PM INF listening on address=/ip4/192.168.1.187/tcp/36656/p2p/12D3KooWELDpQFaaJqZBMNrL93KUHxVLsByEp5DJWuLhupJZ6x7F module=p2p
1:43PM INF no seed nodes - only listening for connections module=p2p
1:43PM INF working in aggregator mode block time=1000 module=server
1:43PM INF Creating and publishing block height=64 module=BlockManager
1:43PM INF starting gRPC server... address=localhost:9090 module=grpc-server
1:43PM INF finalized block block_app_hash=631340C19AF7DFD60DF5763E3E2259F0B0360258D039EF70C18E73BC175FEFDD height=64 module=BlockManager num_txs_res=0 num_val_updates=0
1:43PM INF executed block app_hash=631340C19AF7DFD60DF5763E3E2259F0B0360258D039EF70C18E73BC175FEFDD height=64 module=BlockManager
1:43PM INF indexed block events height=64 module=txindex
1:43PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=1 module=BlockManager
1:43PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=2 module=BlockManager
1:43PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=3 module=BlockManager
1:43PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=4 module=BlockManager
1:43PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=5 module=BlockManager
1:44PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=6 module=BlockManager
1:44PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=7 module=BlockManager
1:44PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=8 module=BlockManager
1:44PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=9 module=BlockManager
1:44PM INF successfully submitted Rollkit block to DA layer count=1 daHeight=1120183 module=BlockManager
on the initial
1:43PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=1 module=BlockManager
did you already have a chain running previously and the transaction is stuck?
it looks like either the old transaction finally failed or it was overridden by a new one with higher fee?
FWIW, when I was testing out a new version of Polaris with v0.12.10 of celestia-da, I did not encounter the same issues with blocks piling up. Here is the issue linked with testing setup and logs for reference.
I am testing a GM rollup now with v0.12.10 of celestia-da. will report back here
edit: i jinxed this.
I'm hitting the same bugs on this.
setup
- celestia-da v0.12.10 with mocha light node
- rollkit v0.11.19
- gm world rollup from tutorial, posting to mocha instead of arabica
- running in this namespace
results
7 batch postings and 564 rollup blocks in, out of 2000 total blocks have been posted. ultimately, the chain stopped posting at block 564 on the rollup and has been stuck in an incorrect account sequence error since block 873. the rollup node, since block start of chain has been giving errors:
first error - timed out waiting to be included in a block
2:25PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = rpc error: code = Unknown desc = timed out waiting for tx to be included in a block" attempt=1 module=BlockManager
second error - tx already in mempool
2:25PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=2 module=BlockManager
third error - 30 attempts failed
2:32PM ERR error while submitting block to DA error="failed to submit block to DA layer after 30 attempts" module=BlockManager
fourth error - incorrect account sequence
2:32PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=1 module=BlockManager
ultimately, the chain stopped posting at block 564 on the rollup and has been stuck in an incorrect account sequence error since block 873
logs
welp
update...because i go back to check celenium after i have stopped the chain and it has posted the last 1400 blocks in 2 blobs
the inconsistency though is worrisome. i am starting up another test of this and will let it run longer.
testing again
off the jump, i hit errors too of 1. timeout -> 2. tx already in mempool -> 3. failed after 30 attempts -> 4. incorrect account sequence 💀
3:18PM INF indexed block events height=65 module=txindex
3:18PM INF Creating and publishing block height=66 module=BlockManager
3:18PM INF finalized block block_app_hash=AA4DE08E50D9CC5C02C63D84D6D36B2AA67E180AC1BAF98FBC96A32A44F1F901 height=66 module=BlockManager num_txs_res=0 num_val_updates=0
3:18PM INF executed block app_hash="�M��P��\\\x02�=���k*�~\x18\n�������*D��\x01" height=66 module=BlockManager
3:18PM INF indexed block events height=66 module=txindex
3:18PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = rpc error: code = Unknown desc = timed out waiting for tx to be included in a block" attempt=1 module=BlockManager
3:18PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=2 module=BlockManager
3:18PM INF Creating and publishing block height=67 module=BlockManager
3:18PM INF finalized block block_app_hash=37B23F80A55DC33C82102F484F2786C5D95327F55443CA2B7C5BB33371341968 height=67 module=BlockManager num_txs_res=0 num_val_updates=0
3:18PM INF executed block app_hash="7�?��]�<�\x10/HO'���S'�TC�+|[�3q4\x19h" height=67 module=BlockManager
3:18PM INF indexed block events height=67 module=txindex
3:18PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=3 module=BlockManager
[...]
3:24PM INF indexed block events height=431 module=txindex
3:24PM ERR error while submitting block to DA error="failed to submit block to DA layer after 30 attempts" module=BlockManager
3:24PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=1 module=BlockManager
update
a blob went through, 500+ blocks after it should have on the rollup
https://mocha.celenium.io/namespace/000000000000000000000000000000000000676f6f646d6f726e696e
the issue is that the cycle of 1. timeout -> 2. tx already in mempool -> 3. failed after 30 attempts -> 4. incorrect account sequence
starts again at this point. leading to a massive backlog of blocks piling up.
800+ blocks in, 15 blocks have been posted to DA
second blob has been posted 17 minutes later
446/1600 blocks have been posted to DA
I'd like to stop adding to this thread, but new issues keep popping up. At time of writing this, 2600 blocks had been created, but 3348 blobs have been posted to the namespace
there are now 2.5 times as many blobs posted to DA as there are rollup blocks that have been posted.
9143/3647 = ~2.50
there is no real way for me to know if this rollup has actually posted everything, or just posted the same blocks multiple times.
i'm now stopping the chain, because no new blobs have been posted, 3000 more blocks have been produced
It seems the source of the last issue (last 1, 2, 3 comments) where the same batch was posted more than once, is related to https://github.com/rollkit/rollkit/issues/1472
With #1535 we're handling common DA congestion errors like
- timed out waiting for tx to be included in a block
- incorrect account sequence
- insufficient fee
- tx already in mempool
- tx too big
Most of these errors originate from insufficient gas price. We're now adding an increased timeout and retrying the transaction based on the error message.
We can improve this by estimating gas price from third party providers after https://github.com/celestiaorg/celestia-app/issues/3114 is implemented.
Until then we support dynamic gas price retargeting by using the --gas-multiplier
flag, documented in https://github.com/rollkit/docs/pull/328
On celestia-node >= 0.13.7
- timed out waiting for tx to be included in a block
- handled by decreasing blob size
- incorrect account sequence
- no longer an issue
- insufficient fee
- can be configured with
--rollkit.da_gas_price
and--rollkit.da_gas_multiplier
- can be configured with
- tx already in mempool
- no longer an issue
- tx too big
- handled by decreasing blob size
Therefore I believe this can be closed as fixed.