electrs
electrs copied to clipboard
Bug: Crash on start if can't connected to bitcoind
Describe the bug
The core question is - should a daemon like electrs crash on start if it can't connected to bitcoind?
Starting electrs 0.10.1 on x86_64 linux with Config { network: Regtest, db_path: "/build/devimint-7736-767/electrs/regtest", daemon_dir: "/build/devimint-7736-767/bitcoin/regtest", daemon_auth: UserPass("bitcoin", "<sensitive>"), daemon_rpc_addr: 127.0.0.1:17057, daemon_p2p_addr: 127.0.0.1:26493, electrum_rpc_addr: 127.0.0.1:23330, monitoring_addr: 127.0.0.1:24286, wait_duration: 10s, jsonrpc_timeout: 15s, index_batch_size: 10, index_lookup_limit: None, reindex_last_blocks: 0, auto_reindex: true, ignore_mempool: false, sync_once: false, skip_block_download_wait: false, disable_electrum_rpc: false, server_banner: "Welcome to electrs 0.10.1 (Electrum Rust Server)!", signet_magic: fabfb5da, args: [] }
[2024-03-18T06:20:13.921Z INFO electrs::metrics::metrics_impl] serving Prometheus metrics on 127.0.0.1:24286
[2024-03-18T06:20:13.921Z INFO electrs::server] serving Electrum RPC on 127.0.0.1:23330
[2024-03-18T06:20:13.942Z INFO electrs::db] "/build/devimint-7736-767/electrs/regtest": 0 SST files, 0 GB, 0 Grows
[2024-03-18T06:20:13.943Z INFO electrs::db] closing DB at /build/devimint-7736-767/electrs/regtest
Error: electrs failed
Caused by:
0: bitcoind RPC polling failed
1: daemon not available
2: JSON-RPC error: transport error: Couldn't connect to host: Connection refused (os error 111)
Note the first timestamp: 20:13.921
the whole test suite started:
[2m2024-03-18T06:20:13.911817Z[0m [32m INFO[0m [2mdevimint[0m[2m:[0m Setting up test dir [3mpath[0m[2m=[0m/build/devimint-7736-767
timestamp: 20:13.911
bitcoind
spawned in the background earlier, but was available for querying only a few seconds later. But 30ms into the test suite, electrs
already gave up on it.
It seems like all Bitcoin daemons we're using are like that: lightningd, lnd, electrs. which makes me wonder - is this some shared design decision, that I never learned, or just a weird coincidence. :D . All three are different languages, different teams etc.
Sure in a real deployment, there always will be some kind of supervisor to restart things, but still... I would expect daemons to never shut down just because they can't connect to another networked service. What's the point, if the supervisor ... is just going to start them again.
The context is: I'm trying to optimize our test suite starting time: letting more things start in parallel, etc. And it would be nice if I could start some daemons around the same time I'm starting bitcoind
, and not have to postpone everything until bitcoind
takes a shower, brushes teeth, eats breakfast and is finally ready for work.
Can I work on this if the issue has not been solved?
Would checking if a daemon like bitcoind is running after every few seconds, as set in config file work and log an ERROR
on each retry and log an INFO
on each successful connection solve the issue ?
There's some design debate here to be had, I guess. Does electrs
really need to connect on start to bitcoind
? If so, it probably could block for some time, sleep ,retry, etc. until maybe eventually give up. If not really - then doing on start should be converted to a normal operation loop. What I mean by that: in a normal operation electrs probably runs some loop/listen for notifications etc. and can tolerate temporary connectivity issues by just retrying. Maybe whatever is failing if bitcoind is not reachable on start, could be converted to be a part of such a high-level operation loop and retry all the same if anything goes wrong.