lighthouse
lighthouse copied to clipboard
Have `--checkpoint-sync-url` timeout
Description
I have setup LH as a BN on goerli using the following settings
lighthouse bn
--network=goerli
--purge-db
--checkpoint-sync-url=https://goerli.checkpoint-sync.ethdevops.io
--execution-endpoint=http://geth:8551
--execution-jwt=/opt/jwt/jwt.hex
--http
--http-address=0.0.0.0
--http-port=5052
--metrics
--metrics-address=0.0.0.0
--metrics-port=5054
--metrics-allow-origin="*"
However it seems there might be an issue with that ethdevops.io API not returning any data. As a result, LH just hangs indefinitely waiting for a response. Preferably it would either retry or give up and sync another way (e.g. fallback to a hardcoded weak subjectivity checkpoint within the codebase or a full sync even.
Version
Please provide your Lighthouse and Rust version. Are you building from
stable
or unstable
, which commit?
Tried on v2.5.1 and latest-unstable, both using docker images
Present Behaviour
Aug 17 10:39:03.534 INFO Logging to file path: "/root/.lighthouse/goerli/beacon/logs/beacon.log"
Aug 17 10:39:03.534 INFO Lighthouse started version: Lighthouse/v2.5.1-c2604c4
Aug 17 10:39:03.535 INFO Configured for network name: goerli
Aug 17 10:39:03.535 INFO Data directory initialised datadir: /root/.lighthouse/goerli
Aug 17 10:39:03.535 INFO Deposit contract address: 0xff50ed3d0ec03ac01d4c79aad74928bff48a7b2b, deploy_block: 4367322
Aug 17 10:39:03.599 INFO Starting checkpoint sync remote_url: https://goerli.checkpoint-sync.ethdevops.io/, service: beacon
Expected Behaviour
After maybe 10 seconds print a timeout log and sync a different way.
Steps to resolve
Right now I had to remove the checkpoint sync entirely to get moving again.
This is a good idea. The HTTP client for checkpoint sync sets a timeout here: https://github.com/sigp/lighthouse/blob/df51a73272489fe154bd10995c96199062b6c3f7/beacon_node/client/src/builder.rs#L276-L277
However that timeout only applies to certain requests common to the validator client. I think we could add two new timeouts for get_beacon_blocks_ssz
and get_debug_beacon_states
, which are the two endpoints currently used by checkpoint sync:
https://github.com/sigp/lighthouse/blob/18c61a5e8be3e54226a86a69b96f8f4f7fd790e4/common/eth2/src/lib.rs#L107-L115
This would also require some refactoring of get_bytes_opt_accept_header
to accept a timeout:
https://github.com/sigp/lighthouse/blob/18c61a5e8be3e54226a86a69b96f8f4f7fd790e4/common/eth2/src/lib.rs#L237-L242
I am planing to work on this issue. If the timeout is triggered, the simplest action is simply terminating the process. My question is that should user be asked to choose another option such as retrying or resuming default syncing?
My question is that should user be asked to choose another option such as retrying or resuming default syncing?
I think it's OK to just exit. Lots of users run lighthouse under systemd which will retry a couple of times by default. Users on the CLI can always retry manually
Resolved by #3521