lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

Have `--checkpoint-sync-url` timeout

Open OisinKyne opened this issue 2 years ago • 3 comments

Description

I have setup LH as a BN on goerli using the following settings

lighthouse bn
      --network=goerli
      --purge-db
      --checkpoint-sync-url=https://goerli.checkpoint-sync.ethdevops.io
      --execution-endpoint=http://geth:8551
      --execution-jwt=/opt/jwt/jwt.hex
      --http
      --http-address=0.0.0.0
      --http-port=5052
      --metrics
      --metrics-address=0.0.0.0
      --metrics-port=5054
      --metrics-allow-origin="*"

However it seems there might be an issue with that ethdevops.io API not returning any data. As a result, LH just hangs indefinitely waiting for a response. Preferably it would either retry or give up and sync another way (e.g. fallback to a hardcoded weak subjectivity checkpoint within the codebase or a full sync even.

Version

Please provide your Lighthouse and Rust version. Are you building from stable or unstable, which commit?

Tried on v2.5.1 and latest-unstable, both using docker images

Present Behaviour

Aug 17 10:39:03.534 INFO Logging to file                         path: "/root/.lighthouse/goerli/beacon/logs/beacon.log"
Aug 17 10:39:03.534 INFO Lighthouse started                      version: Lighthouse/v2.5.1-c2604c4
Aug 17 10:39:03.535 INFO Configured for network                  name: goerli
Aug 17 10:39:03.535 INFO Data directory initialised              datadir: /root/.lighthouse/goerli
Aug 17 10:39:03.535 INFO Deposit contract                        address: 0xff50ed3d0ec03ac01d4c79aad74928bff48a7b2b, deploy_block: 4367322
Aug 17 10:39:03.599 INFO Starting checkpoint sync                remote_url: https://goerli.checkpoint-sync.ethdevops.io/, service: beacon

Expected Behaviour

After maybe 10 seconds print a timeout log and sync a different way.

Steps to resolve

Right now I had to remove the checkpoint sync entirely to get moving again.

OisinKyne avatar Aug 17 '22 10:08 OisinKyne

This is a good idea. The HTTP client for checkpoint sync sets a timeout here: https://github.com/sigp/lighthouse/blob/df51a73272489fe154bd10995c96199062b6c3f7/beacon_node/client/src/builder.rs#L276-L277

However that timeout only applies to certain requests common to the validator client. I think we could add two new timeouts for get_beacon_blocks_ssz and get_debug_beacon_states, which are the two endpoints currently used by checkpoint sync:

https://github.com/sigp/lighthouse/blob/18c61a5e8be3e54226a86a69b96f8f4f7fd790e4/common/eth2/src/lib.rs#L107-L115

This would also require some refactoring of get_bytes_opt_accept_header to accept a timeout:

https://github.com/sigp/lighthouse/blob/18c61a5e8be3e54226a86a69b96f8f4f7fd790e4/common/eth2/src/lib.rs#L237-L242

michaelsproul avatar Aug 24 '22 01:08 michaelsproul

I am planing to work on this issue. If the timeout is triggered, the simplest action is simply terminating the process. My question is that should user be asked to choose another option such as retrying or resuming default syncing?

MaboroshiChan avatar Aug 29 '22 09:08 MaboroshiChan

My question is that should user be asked to choose another option such as retrying or resuming default syncing?

I think it's OK to just exit. Lots of users run lighthouse under systemd which will retry a couple of times by default. Users on the CLI can always retry manually

michaelsproul avatar Aug 29 '22 09:08 michaelsproul

Resolved by #3521

michaelsproul avatar Sep 26 '22 05:09 michaelsproul