mina
mina copied to clipboard
Investigation of long async cycles during initialization on testworld-2-0
We saw long async cycles of ~15 mins on whale-1-1 of testworld-2-0 on 12/5/2023. The long async cycles happens during the initialization of node. Commit: 4403440f55
2023-12-05 19:36:25.123 Mina daemon is listening
2023-12-05 19:36:25.362 Already connected to enough peers, start initialization
2023-12-05 19:52:17.285 Cannot fast forward persistent frontier's root: bootstrap is required ($current_root --> $target_root)
2023-12-05 19:52:17.286 Bootstrap required
2023-12-05 19:52:17.286 Persisted frontier failed to load
2023-12-05 19:52:17.286 Persistent frontier dropped
2023-12-05 19:52:17.290 Fast forward has not been implemented. Bootstrapping instead.
2023-12-05 19:52:17.290 Long async cycle, $long_async_cycle seconds, $monitors, $o1trace
2023-12-05 19:52:17.290 Long async job, $long_async_job seconds, $monitors, $o1trace
2023-12-05 19:52:17.290 transaction_pool $rate_limiter
From the grafana thread timing info. The long async cycle is caused by Transition_frontier.load.
I believe most of time is spent over this function Persistent_frontier.Instance.check_database
.
After we delete the config dir and restart the node, this behavior goes away. And this bug seems not related to #14617. I failed to reproduce this behavior locally.
I think it could be the persistent database is somehow malformed and it leads to the Persistent_frontier.Instance.check_database
takes a long time to go through the database.
This bug seems to be unrelated to the mask changes we had recently on the rampup branch.
Suggestion: add timing log for Persistent_frontier.Instance.check_database
function.
Triage update: we need to get a new reproduction of this issue so that we have sufficient data to debug the issue correctly.
In initial investigation, it seemed this could be related to a potential leak in the rocksdb database. The issue can be reproduced with specific rocksdb instances by copying the database from an effected node.