besu
besu copied to clipboard
Healing flat-db takes node offline for a extended period (48h+)
Description
As an validator, I want flat database healing to be documented to take a very long time and incurring downtime on mainnet so that i can switch beacon-node before starting healing.
Using 24.1.1, i enabled --Xsnapsync-synchronizer-flat-db-healing-enabled=true
and followed the instructions in https://github.com/hyperledger/besu/pull/5319
.
I called the RPC-endpoint on a fully synced besu in order to trigger the healing (as per instructions in the PR):
curl --location --request POST 'http://localhost:8545' \
--header 'Content-Type: application/json' \
--data-raw '{
"jsonrpc": "2.0",
"method": "debug_resyncWorldState",
"params": [],
"id": 1
}'
This resulted in the node going offline/out-of-sync while healing, which was fine, for a while. But after 24h of healing it was still not done, and no ETA is presented in the logs:
Feb 08 08:01:07 validator besu[2940849]: 2024-02-08 08:01:07.057+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO | SnapsyncMetricsManager | Healed 337027245 world state trie nodes, Peer count: 25
Feb 08 08:02:07 validator besu[2940849]: 2024-02-08 08:02:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO | SnapsyncMetricsManager | Healed 337861735 world state trie nodes, Peer count: 25
Feb 08 08:03:07 validator besu[2940849]: 2024-02-08 08:03:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO | SnapsyncMetricsManager | Healed 338623741 world state trie nodes, Peer count: 25
Feb 08 08:04:07 validator besu[2940849]: 2024-02-08 08:04:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO | SnapsyncMetricsManager | Healed 339439837 world state trie nodes, Peer count: 25
As this is now the default behaviour, it should probably be documented that this can take a long time and that downtime is expected as this probably will kick in for some users. (if this is the expected behaviour).
Acceptance Criteria
One or more of:
- Keeping the node online while healing, enabling normal functions
- Documenting that this will take the node offline for an extended time
- Printing ETA on how many nodes are left to heal
Steps to Reproduce (Bug)
- Use fully synced Besu 24.1.1
- Trigger healing by RPC-call to debug_resyncWorldState
- Watch beacon-node and validators going offline while healing runs for a very long time
Expected behavior:
- Node online while healing
Actual behavior:
- Node offline/Unable to validate
Frequency: N/A
Logs (if a bug)
Besu
Feb 08 08:05:07 validator besu[2940849]: 2024-02-08 08:05:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO | SnapsyncMetricsManager | Healed 340216439 world state trie nodes, Peer count: 25
Feb 08 08:06:07 validator besu[2940849]: 2024-02-08 08:06:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO | SnapsyncMetricsManager | Healed 341065952 world state trie nodes, Peer count: 25
Feb 08 08:07:07 validator besu[2940849]: 2024-02-08 08:07:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO | SnapsyncMetricsManager | Healed 341834726 world state trie nodes, Peer count: 25
Feb 08 08:08:07 validator besu[2940849]: 2024-02-08 08:08:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO | SnapsyncMetricsManager | Healed 342700881 world state trie nodes, Peer count: 25
Feb 08 08:09:07 validator besu[2940849]: 2024-02-08 08:09:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO | SnapsyncMetricsManager | Healed 343455446 world state trie nodes, Peer count: 25
Lighthouse:
Feb 08 08:09:29 validator lighthouse[2941077]: Feb 08 08:09:29.001 WARN Head is optimistic execution_block_hash: 0x712a937bd66dbb74e4c02fb96730a696fdc774dbfdbae31eb81ce905a90066ef, info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier
Feb 08 08:09:29 validator lighthouse[2941077]: Feb 08 08:09:29.001 INFO Synced slot: 8379645, block: 0x6208…772f, epoch: 261863, finalized_epoch: 261861, finalized_root: 0x4c73…912b, exec_hash: 0x712a…66ef (unverified), peers: 86, service: slot_notifier
Feb 08 08:09:29 validator lighthouse[2941077]: Feb 08 08:09:29.001 WARN Syncing deposit contract block cache est_blocks_remaining: 9971, service: slot_notifier
Feb 08 08:09:36 validator lighthouse[2941077]: Feb 08 08:09:36.611 INFO New block received root: 0x7f87a6bb0d18bc933224038f6c6be42dda1cd1fc7426a1daeebe945d9be41038, slot: 8379646
Feb 08 08:09:38 validator lighthouse[2941077]: Feb 08 08:09:38.188 WARN Execution endpoint is not synced last_seen_block_unix_timestamp: 0, endpoint: http://127.0.0.1:8551/, auth=true, service: deposit_contract_rpc
Feb 08 08:09:38 validator lighthouse[2941077]: Feb 08 08:09:38.188 ERRO Error updating deposit contract cache error: Failed to get remote head and new block ranges: EndpointError(FarBehind), retry_millis: 60000, service: deposit_contract_rpc
Feb 08 08:09:41 validator lighthouse[2941077]: Feb 08 08:09:41.000 WARN Head is optimistic execution_block_hash: 0xb7ba2bfdb522f9472542ecbbe1b643a3ae16b9b15be380bf76d6de2e1c3ff01d, info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier
Feb 08 08:09:41 validator lighthouse[2941077]: Feb 08 08:09:41.000 INFO Synced slot: 8379646, block: 0x7f87…1038, epoch: 261863, finalized_epoch: 261861, finalized_root: 0x4c73…912b, exec_hash: 0xb7ba…f01d (unverified), peers: 86, service: slot_notifier
Feb 08 08:09:41 validator lighthouse[2941077]: Feb 08 08:09:41.000 WARN Syncing deposit contract block cache est_blocks_remaining: 9971, service: slot_notifier
Feb 08 08:09:48 validator lighthouse[2941077]: Feb 08 08:09:48.320 INFO New block received root: 0xc0109ff3a900218a7754dbda5512b71514a9050d1685b2b2e7c6c0fe4298a9e6, slot: 8379647
Feb 08 08:09:53 validator lighthouse[2941077]: Feb 08 08:09:53.000 WARN Head is optimistic execution_block_hash: 0x43531a956d77b857df1a49500612cbdeeaa3f338947c0c4fd5cc7b2d31fefbfa, info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier
Feb 08 08:09:53 validator lighthouse[2941077]: Feb 08 08:09:53.001 INFO Synced slot: 8379647, block: 0xc010…a9e6, epoch: 261863, finalized_epoch: 261861, finalized_root: 0x4c73…912b, exec_hash: 0x4353…fbfa (unverified), peers: 86, service: slot_notifier
Feb 08 08:09:53 validator lighthouse[2941077]: Feb 08 08:09:53.001 WARN Syncing deposit contract block cache est_blocks_remaining: 9971, service: slot_notifier
Versions (Add all that apply)
- Software version: besu 24.1.1
- Java version: openjdk version "17.0.9" 2023-10-17
- OS Name & Version: Ubuntu 22.04.2 LTS
- Kernel Version: 5.15.0-71-generic
- Consensus Client & Version if using Proof of Stake: Lighthouse 4.6.0
Additional Information
- Intel(R) Xeon(R) W-1290P, 128GB RAM, Samsung PM9A3 3.84TB NVMe
sometime this call debug_resyncWorldState is not running correctly and trigger a full heal instead of having a correct snapsync. it's an issue we need to fix. I think the best is to resync from scratch with the flat healing flag. it should be good . It's not an issue with the flat db healing but more with debug_resyncWorldState