besu Healing flat-db takes node offline for a extended period (48h+)

Healing flat-db takes node offline for a extended period (48h+)

Open p4fg opened this issue 1 year ago • 1 comments

Description

As an validator, I want flat database healing to be documented to take a very long time and incurring downtime on mainnet so that i can switch beacon-node before starting healing.

Using 24.1.1, i enabled --Xsnapsync-synchronizer-flat-db-healing-enabled=true and followed the instructions in https://github.com/hyperledger/besu/pull/5319 .

I called the RPC-endpoint on a fully synced besu in order to trigger the healing (as per instructions in the PR):

curl --location --request POST 'http://localhost:8545' \
--header 'Content-Type: application/json' \
--data-raw '{
    "jsonrpc": "2.0",
    "method": "debug_resyncWorldState",
    "params": [],
    "id": 1
}'

This resulted in the node going offline/out-of-sync while healing, which was fine, for a while. But after 24h of healing it was still not done, and no ETA is presented in the logs:

Feb 08 08:01:07 validator besu[2940849]: 2024-02-08 08:01:07.057+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 337027245 world state trie nodes, Peer count: 25
Feb 08 08:02:07 validator besu[2940849]: 2024-02-08 08:02:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 337861735 world state trie nodes, Peer count: 25
Feb 08 08:03:07 validator besu[2940849]: 2024-02-08 08:03:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 338623741 world state trie nodes, Peer count: 25
Feb 08 08:04:07 validator besu[2940849]: 2024-02-08 08:04:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 339439837 world state trie nodes, Peer count: 25

As this is now the default behaviour, it should probably be documented that this can take a long time and that downtime is expected as this probably will kick in for some users. (if this is the expected behaviour).

Acceptance Criteria

One or more of:

Keeping the node online while healing, enabling normal functions
Documenting that this will take the node offline for an extended time
Printing ETA on how many nodes are left to heal

Steps to Reproduce (Bug)

Use fully synced Besu 24.1.1
Trigger healing by RPC-call to debug_resyncWorldState
Watch beacon-node and validators going offline while healing runs for a very long time

Expected behavior:

Node online while healing

Actual behavior:

Node offline/Unable to validate

Frequency: N/A

Logs (if a bug)

Besu

Feb 08 08:05:07 validator besu[2940849]: 2024-02-08 08:05:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 340216439 world state trie nodes, Peer count: 25
Feb 08 08:06:07 validator besu[2940849]: 2024-02-08 08:06:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 341065952 world state trie nodes, Peer count: 25
Feb 08 08:07:07 validator besu[2940849]: 2024-02-08 08:07:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 341834726 world state trie nodes, Peer count: 25
Feb 08 08:08:07 validator besu[2940849]: 2024-02-08 08:08:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 342700881 world state trie nodes, Peer count: 25
Feb 08 08:09:07 validator besu[2940849]: 2024-02-08 08:09:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 343455446 world state trie nodes, Peer count: 25

Lighthouse:

Feb 08 08:09:29 validator lighthouse[2941077]: Feb 08 08:09:29.001 WARN Head is optimistic                      execution_block_hash: 0x712a937bd66dbb74e4c02fb96730a696fdc774dbfdbae31eb81ce905a90066ef, info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier
Feb 08 08:09:29 validator lighthouse[2941077]: Feb 08 08:09:29.001 INFO Synced                                  slot: 8379645, block: 0x6208…772f, epoch: 261863, finalized_epoch: 261861, finalized_root: 0x4c73…912b, exec_hash: 0x712a…66ef (unverified), peers: 86, service: slot_notifier
Feb 08 08:09:29 validator lighthouse[2941077]: Feb 08 08:09:29.001 WARN Syncing deposit contract block cache    est_blocks_remaining: 9971, service: slot_notifier
Feb 08 08:09:36 validator lighthouse[2941077]: Feb 08 08:09:36.611 INFO New block received                      root: 0x7f87a6bb0d18bc933224038f6c6be42dda1cd1fc7426a1daeebe945d9be41038, slot: 8379646
Feb 08 08:09:38 validator lighthouse[2941077]: Feb 08 08:09:38.188 WARN Execution endpoint is not synced        last_seen_block_unix_timestamp: 0, endpoint: http://127.0.0.1:8551/, auth=true, service: deposit_contract_rpc
Feb 08 08:09:38 validator lighthouse[2941077]: Feb 08 08:09:38.188 ERRO Error updating deposit contract cache   error: Failed to get remote head and new block ranges: EndpointError(FarBehind), retry_millis: 60000, service: deposit_contract_rpc
Feb 08 08:09:41 validator lighthouse[2941077]: Feb 08 08:09:41.000 WARN Head is optimistic                      execution_block_hash: 0xb7ba2bfdb522f9472542ecbbe1b643a3ae16b9b15be380bf76d6de2e1c3ff01d, info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier
Feb 08 08:09:41 validator lighthouse[2941077]: Feb 08 08:09:41.000 INFO Synced                                  slot: 8379646, block: 0x7f87…1038, epoch: 261863, finalized_epoch: 261861, finalized_root: 0x4c73…912b, exec_hash: 0xb7ba…f01d (unverified), peers: 86, service: slot_notifier
Feb 08 08:09:41 validator lighthouse[2941077]: Feb 08 08:09:41.000 WARN Syncing deposit contract block cache    est_blocks_remaining: 9971, service: slot_notifier
Feb 08 08:09:48 validator lighthouse[2941077]: Feb 08 08:09:48.320 INFO New block received                      root: 0xc0109ff3a900218a7754dbda5512b71514a9050d1685b2b2e7c6c0fe4298a9e6, slot: 8379647
Feb 08 08:09:53 validator lighthouse[2941077]: Feb 08 08:09:53.000 WARN Head is optimistic                      execution_block_hash: 0x43531a956d77b857df1a49500612cbdeeaa3f338947c0c4fd5cc7b2d31fefbfa, info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier
Feb 08 08:09:53 validator lighthouse[2941077]: Feb 08 08:09:53.001 INFO Synced                                  slot: 8379647, block: 0xc010…a9e6, epoch: 261863, finalized_epoch: 261861, finalized_root: 0x4c73…912b, exec_hash: 0x4353…fbfa (unverified), peers: 86, service: slot_notifier
Feb 08 08:09:53 validator lighthouse[2941077]: Feb 08 08:09:53.001 WARN Syncing deposit contract block cache    est_blocks_remaining: 9971, service: slot_notifier

Versions (Add all that apply)

Software version: besu 24.1.1
Java version: openjdk version "17.0.9" 2023-10-17
OS Name & Version: Ubuntu 22.04.2 LTS
Kernel Version: 5.15.0-71-generic
Consensus Client & Version if using Proof of Stake: Lighthouse 4.6.0

Additional Information

Intel(R) Xeon(R) W-1290P, 128GB RAM, Samsung PM9A3 3.84TB NVMe

Feb 08 '24 08:02 p4fg

sometime this call debug_resyncWorldState is not running correctly and trigger a full heal instead of having a correct snapsync. it's an issue we need to fix. I think the best is to resync from scratch with the flat healing flag. it should be good . It's not an issue with the flat db healing but more with debug_resyncWorldState

Feb 08 '24 17:02 matkt

besu besu copied to clipboard

Healing flat-db takes node offline for a extended period (48h+)

Description

Acceptance Criteria

Steps to Reproduce (Bug)

Logs (if a bug)

Versions (Add all that apply)

Additional Information

besu
besu copied to clipboard