reth icon indicating copy to clipboard operation
reth copied to clipboard

Metrics enpoint blocks if reth gets stuck

Open erik78se opened this issue 1 year ago • 0 comments

Describe the bug

I was syncing reth. The service ran just fine, but somehow got "stuck".

Indication was that the log said "Checkpoint=0".

When reth was in that state - The "/metrics" endpoint was blocking. Unusable.

I would expect the metrics endpoint to "work" (is it running in the same thread as the main service?)

This is a conversation from telegram on the topic:

image

So, for a node operator, such as us at https://dwellir.com - to be able to monitor for instances finding themselves as "stuck" is monumental.

Would it make sense to break out the thread that manages "/metrics" such as we could discover situations where the reth process "works" but effectively is "stuck" ? This is pretty key for long term operations.

A restart of reth a few times seems to get the node back into business/syncing. But it required babysitting the sync process which isn't possible in a larger scale operations.

Steps to reproduce

  1. Install reth 1.0.3
  2. Install lighthouse
  3. Connect the instances.
  4. Start sync.
  5. Wait to get stuck (Checkpoint=0)
  6. Try calling the /metrics endpoint with curl.
  7. The call blocks. (Error reproduced)

Node logs

Possibly, this is where something goes sideways as I can see the checkpoint=0 coming in here.

024-07-31T17:27:41.068486Z  INFO reth_node_events::node: Preparing stage pipeline_stages=2/12 stage=Bodies checkpoint=20425000 target=20425694
2024-07-31T17:27:41.068545Z  INFO reth_node_events::node: Executing stage pipeline_stages=2/12 stage=Bodies checkpoint=20425000 target=20425694
2024-07-31T17:27:41.419545Z  INFO reth_node_events::node: Finished stage pipeline_stages=2/12 stage=Bodies checkpoint=20425694 target=20425694 stage_progress=100.00%
2024-07-31T17:27:41.434439Z DEBUG provider::static_file: Commit segment=Transactions path="/root/snap/reth/15/.local/share/reth/mainnet/static_files/static_file_transactions_20000000_20499999" duration=14.954467ms
2024-07-31T17:27:41.435961Z DEBUG storage::db::mdbx: Commit total_duration=1.408035ms commit_latency=Some(CommitLatency(MDBX_commit_latency { preparation: 0, gc_wallclock: 2, audit: 0, write: 0, sync: 87, ending: 0, whole: 90,
 gc_cputime: 0, gc_prof: MDBX_commit_latency__bindgen_ty_1 { wloops: 0, coalescences: 0, wipes: 0, flushes: 0, kicks: 0, work_counter: 0, work_rtime_monotonic: 0, work_xtime_cpu: 0, work_rsteps: 0, work_xpages: 0, work_majflt:
 0, self_counter: 0, self_rtime_monotonic: 0, self_xtime_cpu: 0, self_rsteps: 0, self_xpages: 0, self_majflt: 0 } })) is_read_only=false
2024-07-31T17:27:41.436044Z  INFO reth_node_events::node: Preparing stage pipeline_stages=3/12 stage=SenderRecovery checkpoint=0 target=20425694
2024-07-31T17:27:41.436055Z  INFO reth_node_events::node: Executing stage pipeline_stages=3/12 stage=SenderRecovery checkpoint=0 target=20425694
2024-07-31T17:27:41.437466Z  INFO sync::stages::sender_recovery: Recovering senders tx_range=0..5000003
2024-07-31T17:27:41.437470Z DEBUG sync::stages::sender_recovery: Recovering senders batch tx_range=0..100000
2024-07-31T17:27:41.437920Z DEBUG sync::stages::sender_recovery: Appending recovered senders to the database tx_range=0..100000
2024-07-31T17:27:42.199709Z DEBUG sync::stages::sender_recovery: Finished recovering senders batch tx_range=0..100000
2024-07-31T17:27:42.199724Z DEBUG sync::stages::sender_recovery: Recovering senders batch tx_range=100000..200000
2024-07-31T17:27:42.199975Z DEBUG sync::stages::sender_recovery: Appending recovered senders to the database tx_range=100000..200000
2024-07-31T17:27:42.961397Z DEBUG sync::stages::sender_recovery: Finished recovering senders batch tx_range=100000..200000
2024-07-31T17:27:42.961412Z DEBUG sync::stages::sender_recovery: Recovering senders batch tx_range=200000..300000
2024-07-31T17:27:42.961557Z DEBUG sync::stages::sender_recovery: Appending recovered senders to the database tx_range=200000..300000
2024-07-31T17:27:43.724572Z DEBUG sync::stages::sender_recovery: Finished recovering senders batch tx_range=200000..300000
2024-07-31T17:27:43.724587Z DEBUG sync::stages::sender_recovery: Recovering senders batch tx_range=300000..400000
2024-07-31T17:27:43.724714Z DEBUG sync::stages::sender_recovery: Appending recovered senders to the database tx_range=300000..400000
2024-07-31T17:27:43.757039Z  INFO reth::cli: Status connected_peers=74 freelist=15 stage=SenderRecovery checkpoint=0 target=20425694
2024-07-31T17:27:44.470751Z DEBUG sync::stages::sender_recovery: Finished recovering senders batch tx_range=300000..400000
2024-07-31T17:27:44.470767Z DEBUG sync::stages::sender_recovery: Recovering senders batch tx_range=400000..500000
2024-07-31T17:27:44.470927Z DEBUG sync::stages::sender_recovery: Appending recovered senders to the database tx_range=400000..500000
2024-07-31T17:27:45.213033Z DEBUG sync::stages::sender_recovery: Finished recovering senders batch tx_range=400000..500000
2024-07-31T17:27:45.213048Z DEBUG sync::stages::sender_recovery: Recovering senders batch tx_range=500000..600000
2024-07-31T17:27:45.213171Z DEBUG sync::stages::sender_recovery: Appending recovered senders to the database tx_range=500000..600000
2024-07-31T17:27:45.955626Z DEBUG sync::stages::sender_recovery: Finished recovering senders batch tx_range=500000..600000

... and a bit later:

2024-07-31T22:39:30.580987Z DEBUG sync::stages::sender_recovery: Finished recovering senders batch tx_range=2458948819..2458980010
2024-07-31T22:39:30.581143Z  INFO reth_node_events::node: Finished stage pipeline_stages=3/12 stage=SenderRecovery checkpoint=20425694 target=20425694 stage_progress=100.00% stage_eta=unknown
2024-07-31T22:39:30.741552Z DEBUG storage::db::mdbx: Commit total_duration=160.453687ms commit_latency=Some(CommitLatency(MDBX_commit_latency { preparation: 0, gc_wallclock: 2, audit: 0, write: 0, sync: 10496, ending: 0, whole
: 10498, gc_cputime: 0, gc_prof: MDBX_commit_latency__bindgen_ty_1 { wloops: 0, coalescences: 0, wipes: 0, flushes: 0, kicks: 0, work_counter: 0, work_rtime_monotonic: 0, work_xtime_cpu: 0, work_rsteps: 0, work_xpages: 0, work
_majflt: 0, self_counter: 0, self_rtime_monotonic: 0, self_xtime_cpu: 0, self_rsteps: 0, self_xpages: 0, self_majflt: 0 } })) is_read_only=false
2024-07-31T22:39:30.741675Z  INFO reth_node_events::node: Preparing stage pipeline_stages=4/12 stage=Execution checkpoint=0 target=20425694
2024-07-31T22:39:30.741931Z  INFO reth_node_events::node: Executing stage pipeline_stages=4/12 stage=Execution checkpoint=0 target=20425694
2024-07-31T22:39:30.747958Z DEBUG sync::stages::execution: Preparing static file producer start=1
2024-07-31T22:39:30.762505Z DEBUG provider::static_file: Commit segment=Receipts path="/root/snap/reth/15/.local/share/reth/mainnet/static_files/static_file_receipts_0_499999" duration=966.595µs
2024-07-31T22:39:30.765979Z DEBUG sync::stages::execution: Calculating gas used from headers range=0..=0
2024-07-31T22:39:30.769162Z DEBUG sync::stages::execution: Finished calculating gas used from headers range=0..=0 duration=14.802µs
2024-07-31T22:39:30.769166Z DEBUG sync::stages::execution: Calculating gas used from headers range=1..=20425694
2024-07-31T22:39:37.324614Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-07-31T22:39:48.314808Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-07-31T22:39:48.819466Z  INFO reth::cli: Status connected_peers=100 freelist=8 stage=Execution checkpoint=0 target=20425694
2024-07-31T22:40:01.092377Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-07-31T22:40:12.617997Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-07-31T22:40:13.819023Z  INFO reth::cli: Status connected_peers=100 freelist=8 stage=Execution checkpoint=0 target=20425694
2024-07-31T22:40:22.521493Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-07-31T22:40:22.521548Z  INFO reth_node_events::node: Received forkchoice updated message when syncing head_block_hash=0x0a50b9d682a73afc17c2f7f2d3059bbb5dd5c8d94efa133955b366dd9a29a844 safe_block_hash=0x4028230abf10b1e44d628a2120a2240f60315d3cdc3d7b9e1e31f1a5934b8da6 finalized_block_hash=0x44f38f31819708389bf4455f6c8e5433e093cadb8d2bbaf86024c9eff0f9f7a3

Platform(s)

Linux (x86)

What version/commit are you on?

reth Version: 1.0.3 Commit SHA: 390f30aadebcdd509e72cc04327c3b854de076a6 Build Timestamp: 2024-07-18T22:01:22.170093482Z Build Features: jemalloc Build Profile: maxperf

What database version are you on?

reth db version 2024-08-01T11:12:59.670173Z INFO Initialized tracing, debug log directory: /root/snap/reth/15/.cache/reth/logs/mainnet Current database version: 2 Local database version: 2

Which chain / network are you on?

mainnet

What type of node are you running?

Archive (default)

What prune config do you use, if any?

Cant find that section. [prune]

If you've built Reth from source, provide the full command you used

CARGO_TARGET_DIR=target cargo build --bin reth --profile maxperf --features jemalloc

Code of Conduct

  • [X] I agree to follow the Code of Conduct

erik78se avatar Aug 01 '24 11:08 erik78se