electrs icon indicating copy to clipboard operation
electrs copied to clipboard

Bug: Crash on start if can't connected to bitcoind

Open dpc opened this issue 3 months ago • 3 comments

Describe the bug

The core question is - should a daemon like electrs crash on start if it can't connected to bitcoind?

Starting electrs 0.10.1 on x86_64 linux with Config { network: Regtest, db_path: "/build/devimint-7736-767/electrs/regtest", daemon_dir: "/build/devimint-7736-767/bitcoin/regtest", daemon_auth: UserPass("bitcoin", "<sensitive>"), daemon_rpc_addr: 127.0.0.1:17057, daemon_p2p_addr: 127.0.0.1:26493, electrum_rpc_addr: 127.0.0.1:23330, monitoring_addr: 127.0.0.1:24286, wait_duration: 10s, jsonrpc_timeout: 15s, index_batch_size: 10, index_lookup_limit: None, reindex_last_blocks: 0, auto_reindex: true, ignore_mempool: false, sync_once: false, skip_block_download_wait: false, disable_electrum_rpc: false, server_banner: "Welcome to electrs 0.10.1 (Electrum Rust Server)!", signet_magic: fabfb5da, args: [] }
[2024-03-18T06:20:13.921Z INFO  electrs::metrics::metrics_impl] serving Prometheus metrics on 127.0.0.1:24286
[2024-03-18T06:20:13.921Z INFO  electrs::server] serving Electrum RPC on 127.0.0.1:23330
[2024-03-18T06:20:13.942Z INFO  electrs::db] "/build/devimint-7736-767/electrs/regtest": 0 SST files, 0 GB, 0 Grows
[2024-03-18T06:20:13.943Z INFO  electrs::db] closing DB at /build/devimint-7736-767/electrs/regtest
Error: electrs failed

Caused by:
    0: bitcoind RPC polling failed
    1: daemon not available
    2: JSON-RPC error: transport error: Couldn't connect to host: Connection refused (os error 111)

Note the first timestamp: 20:13.921

the whole test suite started:

[2m2024-03-18T06:20:13.911817Z[0m [32m INFO[0m [2mdevimint[0m[2m:[0m Setting up test dir [3mpath[0m[2m=[0m/build/devimint-7736-767

timestamp: 20:13.911

bitcoind spawned in the background earlier, but was available for querying only a few seconds later. But 30ms into the test suite, electrs already gave up on it.

It seems like all Bitcoin daemons we're using are like that: lightningd, lnd, electrs. which makes me wonder - is this some shared design decision, that I never learned, or just a weird coincidence. :D . All three are different languages, different teams etc.

Sure in a real deployment, there always will be some kind of supervisor to restart things, but still... I would expect daemons to never shut down just because they can't connect to another networked service. What's the point, if the supervisor ... is just going to start them again.

The context is: I'm trying to optimize our test suite starting time: letting more things start in parallel, etc. And it would be nice if I could start some daemons around the same time I'm starting bitcoind, and not have to postpone everything until bitcoind takes a shower, brushes teeth, eats breakfast and is finally ready for work.

dpc avatar Mar 18 '24 07:03 dpc

Can I work on this if the issue has not been solved?

448-OG avatar Apr 04 '24 11:04 448-OG

Would checking if a daemon like bitcoind is running after every few seconds, as set in config file work and log an ERROR on each retry and log an INFO on each successful connection solve the issue ?

448-OG avatar Apr 04 '24 11:04 448-OG

There's some design debate here to be had, I guess. Does electrs really need to connect on start to bitcoind? If so, it probably could block for some time, sleep ,retry, etc. until maybe eventually give up. If not really - then doing on start should be converted to a normal operation loop. What I mean by that: in a normal operation electrs probably runs some loop/listen for notifications etc. and can tolerate temporary connectivity issues by just retrying. Maybe whatever is failing if bitcoind is not reachable on start, could be converted to be a part of such a high-level operation loop and retry all the same if anything goes wrong.

dpc avatar Apr 04 '24 19:04 dpc