electrs Bug: Crash on start if can't connected to bitcoind

Describe the bug

The core question is - should a daemon like electrs crash on start if it can't connected to bitcoind?

Starting electrs 0.10.1 on x86_64 linux with Config { network: Regtest, db_path: "/build/devimint-7736-767/electrs/regtest", daemon_dir: "/build/devimint-7736-767/bitcoin/regtest", daemon_auth: UserPass("bitcoin", "<sensitive>"), daemon_rpc_addr: 127.0.0.1:17057, daemon_p2p_addr: 127.0.0.1:26493, electrum_rpc_addr: 127.0.0.1:23330, monitoring_addr: 127.0.0.1:24286, wait_duration: 10s, jsonrpc_timeout: 15s, index_batch_size: 10, index_lookup_limit: None, reindex_last_blocks: 0, auto_reindex: true, ignore_mempool: false, sync_once: false, skip_block_download_wait: false, disable_electrum_rpc: false, server_banner: "Welcome to electrs 0.10.1 (Electrum Rust Server)!", signet_magic: fabfb5da, args: [] }
[2024-03-18T06:20:13.921Z INFO  electrs::metrics::metrics_impl] serving Prometheus metrics on 127.0.0.1:24286
[2024-03-18T06:20:13.921Z INFO  electrs::server] serving Electrum RPC on 127.0.0.1:23330
[2024-03-18T06:20:13.942Z INFO  electrs::db] "/build/devimint-7736-767/electrs/regtest": 0 SST files, 0 GB, 0 Grows
[2024-03-18T06:20:13.943Z INFO  electrs::db] closing DB at /build/devimint-7736-767/electrs/regtest
Error: electrs failed

Caused by:
    0: bitcoind RPC polling failed
    1: daemon not available
    2: JSON-RPC error: transport error: Couldn't connect to host: Connection refused (os error 111)

Note the first timestamp: 20:13.921

the whole test suite started:

[2m2024-03-18T06:20:13.911817Z[0m [32m INFO[0m [2mdevimint[0m[2m:[0m Setting up test dir [3mpath[0m[2m=[0m/build/devimint-7736-767

timestamp: 20:13.911

bitcoind spawned in the background earlier, but was available for querying only a few seconds later. But 30ms into the test suite, electrs already gave up on it.

It seems like all Bitcoin daemons we're using are like that: lightningd, lnd, electrs. which makes me wonder - is this some shared design decision, that I never learned, or just a weird coincidence. :D . All three are different languages, different teams etc.

Sure in a real deployment, there always will be some kind of supervisor to restart things, but still... I would expect daemons to never shut down just because they can't connect to another networked service. What's the point, if the supervisor ... is just going to start them again.

The context is: I'm trying to optimize our test suite starting time: letting more things start in parallel, etc. And it would be nice if I could start some daemons around the same time I'm starting bitcoind, and not have to postpone everything until bitcoind takes a shower, brushes teeth, eats breakfast and is finally ready for work.

Mar 18 '24 07:03 dpc

Can I work on this if the issue has not been solved?

Apr 04 '24 11:04 448-OG

Would checking if a daemon like bitcoind is running after every few seconds, as set in config file work and log an ERROR on each retry and log an INFO on each successful connection solve the issue ?

Apr 04 '24 11:04 448-OG

There's some design debate here to be had, I guess. Does electrs really need to connect on start to bitcoind? If so, it probably could block for some time, sleep ,retry, etc. until maybe eventually give up. If not really - then doing on start should be converted to a normal operation loop. What I mean by that: in a normal operation electrs probably runs some loop/listen for notifications etc. and can tolerate temporary connectivity issues by just retrying. Maybe whatever is failing if bitcoind is not reachable on start, could be converted to be a part of such a high-level operation loop and retry all the same if anything goes wrong.

Apr 04 '24 19:04 dpc

There's a very good reason to do that: to detect misconfiguration early. If these daemons ignored failures then it'd take extra steps to verify that the configuration is correct.

However your request is a very valid one and it has a neat solution: use systemd socket activation. The trick is to bind sockets first and then start all services in parallel. Once a service needs to call into another service it just blocks until the other service is ready. You can also postpone start of a service until it's actually needed by something (but that's probably not your problem).

However to support socket activation you need the service to be able to reuse an already-bound socket. IOW it requires support from bitcoind. Adding it directly would be nice but meanwhile you can try using this: https://github.com/ryancdotorg/libsdsock

Note also that bitcoind has startupnotify option which allows you to start electrs right after it launches. It's not as great as parallel start but it at least minimizes the time difference.

Aug 05 '24 10:08 Kixunil

There's a very good reason to do that: to detect misconfiguration early.

Yeah, the motives are good, but in practice, for a long running service it just means the external service behavior is inconsistent. "detect misconfig. early" matters maybe when the service is first being set up. But this inconsistent behavior is present every time the app starts afterwards.

However your request is a very valid one and it has a neat solution: use systemd socket activation.

Ignoring it being OS-dependent, somewhat extra complexity, etc. it just doesn't work e.g. if bitcoind is remote. One could be using wireguard or some other tunneling, and just have connectivity issues, the service works somewhat OK until it happens to be restarted and then it starts failing. Weird.

The inconsistency of behavior when starting and already running makes the "detect misconfiguration" motivation invalid.

Aug 05 '24 15:08 dpc

But this inconsistent behavior is present every time the app starts afterwards.

To solve this we'd have to remember what the last configuration was and then compare them. (or their hashes). It smells bad but I'm not sure why.

it just doesn't work e.g. if bitcoind is remote

I think it's safe to assume tests are running locally using regtest and that's where speed matters. Once you deploy it the speed is not really that important because you'll launch it once per several months at most.

The inconsistency of behavior when starting and already running makes the "detect misconfiguration" motivation invalid.

How? The configuration cannot be changed run time so one has to restart anyway.

Aug 05 '24 15:08 Kixunil

How? The configuration cannot be changed run time so one has to restart anyway.

If bitcoind is unavailable while electrs is already running, electrs will just keep retrying. If it's unavailable when electrs is starting, electrs will fail.

bitcoind configuration can change without electrs restarting. Even if just for this reason, it's not electrs job to test misconfigurations. These matters need to be handled e2e, and continuously.

Trying to detect misconfiguration might be well meaning, but just misplaced and misguided, as it introduces weird behavior inconsistency, attempting to achieve something that can't be done right at this level anyway.

Aug 05 '24 15:08 dpc

IME changing ports and such which would have a real impact on this is very rare. It basically never happens. Getting initial configuration to work is the "hard" part. I believe the current behavior provides great balance of costs and benefits even if it looks inconsistent.

Also if your goal is to start the tests as soon as possible then we should have some way to force electrs retrying connection before a timer expires. Probably using a signal. But it'd be best to check if systemd supports some mechanism to do this and make it compatible.

Aug 06 '24 09:08 Kixunil

hitting this while doing some tests:

bitcoind running as a systemd service
electrs running as a systemd service

if i restart the bitcoind service, electrs will stop on:

Error: electrs failed

Caused by:
    0: sync failed
    1: sending on a disconnected channel

it's not a big deal in this case as the electrs service will restard, but look a bit ugly, i'd expected at least a timeout in this case, not a direct stop

Sep 03 '24 05:09 pythcoiner

@pythcoiner that doesn't match what @dpc said - that it does retry. Which of you is correct? I do think it should retry if it's already running.

Sep 03 '24 05:09 Kixunil

@Kixunil in my case it does not retry at all directly stop at the instant i systemctl restart bitcoind. Maybe a different issue, i can open another issue if so

Sep 03 '24 07:09 pythcoiner

electrs electrs copied to clipboard

Bug: Crash on start if can't connected to bitcoind

electrs
electrs copied to clipboard