besu icon indicating copy to clipboard operation
besu copied to clipboard

Healing flat-db takes node offline for a extended period (48h+)

Open p4fg opened this issue 1 year ago • 1 comments

Description

As an validator, I want flat database healing to be documented to take a very long time and incurring downtime on mainnet so that i can switch beacon-node before starting healing.

Using 24.1.1, i enabled --Xsnapsync-synchronizer-flat-db-healing-enabled=true and followed the instructions in https://github.com/hyperledger/besu/pull/5319 .

I called the RPC-endpoint on a fully synced besu in order to trigger the healing (as per instructions in the PR):

curl --location --request POST 'http://localhost:8545' \
--header 'Content-Type: application/json' \
--data-raw '{
    "jsonrpc": "2.0",
    "method": "debug_resyncWorldState",
    "params": [],
    "id": 1
}' 

This resulted in the node going offline/out-of-sync while healing, which was fine, for a while. But after 24h of healing it was still not done, and no ETA is presented in the logs:

Feb 08 08:01:07 validator besu[2940849]: 2024-02-08 08:01:07.057+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 337027245 world state trie nodes, Peer count: 25
Feb 08 08:02:07 validator besu[2940849]: 2024-02-08 08:02:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 337861735 world state trie nodes, Peer count: 25
Feb 08 08:03:07 validator besu[2940849]: 2024-02-08 08:03:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 338623741 world state trie nodes, Peer count: 25
Feb 08 08:04:07 validator besu[2940849]: 2024-02-08 08:04:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 339439837 world state trie nodes, Peer count: 25

As this is now the default behaviour, it should probably be documented that this can take a long time and that downtime is expected as this probably will kick in for some users. (if this is the expected behaviour).

Acceptance Criteria

One or more of:

  • Keeping the node online while healing, enabling normal functions
  • Documenting that this will take the node offline for an extended time
  • Printing ETA on how many nodes are left to heal

Steps to Reproduce (Bug)

  1. Use fully synced Besu 24.1.1
  2. Trigger healing by RPC-call to debug_resyncWorldState
  3. Watch beacon-node and validators going offline while healing runs for a very long time

Expected behavior:

  • Node online while healing

Actual behavior:

  • Node offline/Unable to validate

Frequency: N/A

Logs (if a bug)

Besu

Feb 08 08:05:07 validator besu[2940849]: 2024-02-08 08:05:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 340216439 world state trie nodes, Peer count: 25
Feb 08 08:06:07 validator besu[2940849]: 2024-02-08 08:06:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 341065952 world state trie nodes, Peer count: 25
Feb 08 08:07:07 validator besu[2940849]: 2024-02-08 08:07:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 341834726 world state trie nodes, Peer count: 25
Feb 08 08:08:07 validator besu[2940849]: 2024-02-08 08:08:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 342700881 world state trie nodes, Peer count: 25
Feb 08 08:09:07 validator besu[2940849]: 2024-02-08 08:09:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 343455446 world state trie nodes, Peer count: 25

Lighthouse:

Feb 08 08:09:29 validator lighthouse[2941077]: Feb 08 08:09:29.001 WARN Head is optimistic                      execution_block_hash: 0x712a937bd66dbb74e4c02fb96730a696fdc774dbfdbae31eb81ce905a90066ef, info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier
Feb 08 08:09:29 validator lighthouse[2941077]: Feb 08 08:09:29.001 INFO Synced                                  slot: 8379645, block: 0x6208…772f, epoch: 261863, finalized_epoch: 261861, finalized_root: 0x4c73…912b, exec_hash: 0x712a…66ef (unverified), peers: 86, service: slot_notifier
Feb 08 08:09:29 validator lighthouse[2941077]: Feb 08 08:09:29.001 WARN Syncing deposit contract block cache    est_blocks_remaining: 9971, service: slot_notifier
Feb 08 08:09:36 validator lighthouse[2941077]: Feb 08 08:09:36.611 INFO New block received                      root: 0x7f87a6bb0d18bc933224038f6c6be42dda1cd1fc7426a1daeebe945d9be41038, slot: 8379646
Feb 08 08:09:38 validator lighthouse[2941077]: Feb 08 08:09:38.188 WARN Execution endpoint is not synced        last_seen_block_unix_timestamp: 0, endpoint: http://127.0.0.1:8551/, auth=true, service: deposit_contract_rpc
Feb 08 08:09:38 validator lighthouse[2941077]: Feb 08 08:09:38.188 ERRO Error updating deposit contract cache   error: Failed to get remote head and new block ranges: EndpointError(FarBehind), retry_millis: 60000, service: deposit_contract_rpc
Feb 08 08:09:41 validator lighthouse[2941077]: Feb 08 08:09:41.000 WARN Head is optimistic                      execution_block_hash: 0xb7ba2bfdb522f9472542ecbbe1b643a3ae16b9b15be380bf76d6de2e1c3ff01d, info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier
Feb 08 08:09:41 validator lighthouse[2941077]: Feb 08 08:09:41.000 INFO Synced                                  slot: 8379646, block: 0x7f87…1038, epoch: 261863, finalized_epoch: 261861, finalized_root: 0x4c73…912b, exec_hash: 0xb7ba…f01d (unverified), peers: 86, service: slot_notifier
Feb 08 08:09:41 validator lighthouse[2941077]: Feb 08 08:09:41.000 WARN Syncing deposit contract block cache    est_blocks_remaining: 9971, service: slot_notifier
Feb 08 08:09:48 validator lighthouse[2941077]: Feb 08 08:09:48.320 INFO New block received                      root: 0xc0109ff3a900218a7754dbda5512b71514a9050d1685b2b2e7c6c0fe4298a9e6, slot: 8379647
Feb 08 08:09:53 validator lighthouse[2941077]: Feb 08 08:09:53.000 WARN Head is optimistic                      execution_block_hash: 0x43531a956d77b857df1a49500612cbdeeaa3f338947c0c4fd5cc7b2d31fefbfa, info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier
Feb 08 08:09:53 validator lighthouse[2941077]: Feb 08 08:09:53.001 INFO Synced                                  slot: 8379647, block: 0xc010…a9e6, epoch: 261863, finalized_epoch: 261861, finalized_root: 0x4c73…912b, exec_hash: 0x4353…fbfa (unverified), peers: 86, service: slot_notifier
Feb 08 08:09:53 validator lighthouse[2941077]: Feb 08 08:09:53.001 WARN Syncing deposit contract block cache    est_blocks_remaining: 9971, service: slot_notifier

Versions (Add all that apply)

  • Software version: besu 24.1.1
  • Java version: openjdk version "17.0.9" 2023-10-17
  • OS Name & Version: Ubuntu 22.04.2 LTS
  • Kernel Version: 5.15.0-71-generic
  • Consensus Client & Version if using Proof of Stake: Lighthouse 4.6.0

Additional Information

  • Intel(R) Xeon(R) W-1290P, 128GB RAM, Samsung PM9A3 3.84TB NVMe

p4fg avatar Feb 08 '24 08:02 p4fg

sometime this call debug_resyncWorldState is not running correctly and trigger a full heal instead of having a correct snapsync. it's an issue we need to fix. I think the best is to resync from scratch with the flat healing flag. it should be good . It's not an issue with the flat db healing but more with debug_resyncWorldState

matkt avatar Feb 08 '24 17:02 matkt