mina Investigation of long async cycles during initialization on testworld-2-0

Investigation of long async cycles during initialization on testworld-2-0

Open ghost-not-in-the-shell opened this issue 1 year ago • 2 comments

We saw long async cycles of ~15 mins on whale-1-1 of testworld-2-0 on 12/5/2023. The long async cycles happens during the initialization of node. Commit: 4403440f55

2023-12-05 19:36:25.123 Mina daemon is listening
2023-12-05 19:36:25.362 Already connected to enough peers, start initialization
2023-12-05 19:52:17.285 Cannot fast forward persistent frontier's root: bootstrap is required ($current_root --> $target_root)
2023-12-05 19:52:17.286 Bootstrap required
2023-12-05 19:52:17.286 Persisted frontier failed to load
2023-12-05 19:52:17.286 Persistent frontier dropped
2023-12-05 19:52:17.290 Fast forward has not been implemented. Bootstrapping instead.
2023-12-05 19:52:17.290 Long async cycle, $long_async_cycle seconds, $monitors, $o1trace
2023-12-05 19:52:17.290 Long async job, $long_async_job seconds, $monitors, $o1trace
2023-12-05 19:52:17.290 transaction_pool $rate_limiter

From the grafana thread timing info. The long async cycle is caused by Transition_frontier.load.

I believe most of time is spent over this function Persistent_frontier.Instance.check_database.

After we delete the config dir and restart the node, this behavior goes away. And this bug seems not related to #14617. I failed to reproduce this behavior locally.

I think it could be the persistent database is somehow malformed and it leads to the Persistent_frontier.Instance.check_database takes a long time to go through the database.

This bug seems to be unrelated to the mask changes we had recently on the rampup branch.

Suggestion: add timing log for Persistent_frontier.Instance.check_database function.

Dec 06 '23 15:12 ghost-not-in-the-shell

Triage update: we need to get a new reproduction of this issue so that we have sufficient data to debug the issue correctly.

Dec 12 '23 20:12 amc-ie

In initial investigation, it seemed this could be related to a potential leak in the rocksdb database. The issue can be reproduced with specific rocksdb instances by copying the database from an effected node.

Apr 10 '24 19:04 nholland94

mina mina copied to clipboard

Investigation of long async cycles during initialization on testworld-2-0

mina
mina copied to clipboard