bdk icon indicating copy to clipboard operation
bdk copied to clipboard

CI: `blockchain::esplora::bdk_blockchain_tests::test_sync_stop_gap_20` occationally fails

Open evanlinjin opened this issue 3 years ago • 16 comments

Describe the bug

CI: blockchain::esplora::bdk_blockchain_tests::test_sync_stop_gap_20 occationally fails.

The command that fails

$ cargo test --no-default-features --features test-esplora,use-esplora-ureq,verify esplora::bdk_blockchain_tests

Occurences

  • https://github.com/bitcoindevkit/bdk/runs/7163681328?check_suite_focus=true

evanlinjin avatar Jul 02 '22 19:07 evanlinjin

I have been witnessing few occasional failures in blockchain tests in my local too.. Seems like some kind of resource blocking thingy happening somewhere?? I was seeing "resource unavailable" errors few days ago when running blockchian tests all together.. Which I can't seem to reproduce anymore.. Could be something related??

From the error message it seems something got timed out while waiting for some stuffs..

rajarshimaitra avatar Jul 03 '22 10:07 rajarshimaitra

I have been witnessing few occasional failures in blockchain tests in my local too.. Seems like some kind of resource blocking thingy happening somewhere?? I was seeing "resource unavailable" errors few days ago when running blockchian tests all together.. Which I can't seem to reproduce anymore.. Could be something related??

From the error message it seems something got timed out while waiting for some stuffs..

Same here. Would it make sense to implement some retry logic/exponential backoff?

We can keep track of the errors and where they occur here.

evanlinjin avatar Jul 03 '22 16:07 evanlinjin

Same here. Would it make sense to implement some retry logic/exponential backoff?

We do have an exponential pole at electrum calls https://github.com/bitcoindevkit/bdk/blob/063d51fd7587815c733f24365a4de02fed072903/src/testutils/blockchain_tests.rs#L345-L359

To me it seems for some reason the electrum calls isn't returning and some resources are not being free??

The error seems to disappear in single test runs.. Only when I run all the blockchian tests together..

rajarshimaitra avatar Jul 04 '22 09:07 rajarshimaitra

This looks like a bug in electrsd:

[2022-07-02T18:30:55Z ERROR electrsd] early exit with: ExitStatus(unix_wait_status(134))

Maybe it doesn't handle too much load well. I'll see if I can somehow attach a debugger and reproduce.

afilini avatar Jul 04 '22 09:07 afilini

This is a one-off failure, not sure if it is related.

failures:

---- blockchain::esplora::bdk_blockchain_tests::test_double_spend stdout ----
thread 'blockchain::esplora::bdk_blockchain_tests::test_double_spend' panicked at 'called `Result::unwrap()` on an `Err` value: JSON(Error("expected value", line: 1, column: 1))', src/testutils/blockchain_tests.rs:65:56
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    blockchain::esplora::bdk_blockchain_tests::test_double_spend

evanlinjin avatar Jul 04 '22 13:07 evanlinjin

Maybe it's related to this? https://github.com/bitcoindevkit/rust-electrum-client/issues/67

afilini avatar Jul 04 '22 14:07 afilini

failures:

---- blockchain::esplora::bdk_blockchain_tests::test_sync_address_index_should_not_decrement stdout ----
thread 'blockchain::esplora::bdk_blockchain_tests::test_sync_address_index_should_not_decrement' panicked at 'called `Result::unwrap()` on an `Err` value: EarlyExit(ExitStatus(unix_wait_status(134)))', src/testutils/blockchain_tests.rs:36:75
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    blockchain::esplora::bdk_blockchain_tests::test_sync_address_index_should_not_decrement

test result: FAILED. 256 passed; 1 failed; 5 ignored; 0 measured; 0 filtered out; finished in 6.41s

evanlinjin avatar Jul 04 '22 16:07 evanlinjin

 failures:
error: test failed, to rerun pass '--lib'
---- blockchain::esplora::bdk_blockchain_tests::test_tx_chain stdout ----
thread 'blockchain::esplora::bdk_blockchain_tests::test_tx_chain' panicked at 'called `Result::unwrap()` on an `Err` value: EarlyExit(ExitStatus(unix_wait_status(134)))', src/testutils/blockchain_tests.rs:36:75
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
    blockchain::esplora::bdk_blockchain_tests::test_tx_chain

evanlinjin avatar Jul 05 '22 02:07 evanlinjin

---- blockchain::electrum::test::bdk_blockchain_tests::test_sync_bump_fee_basic stdout ----
thread 'blockchain::electrum::test::bdk_blockchain_tests::test_sync_bump_fee_basic' panicked at 'called `Result::unwrap()` on an `Err` value: IOError(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })', src/testutils/blockchain_tests.rs:65:56
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

evanlinjin avatar Jul 12 '22 23:07 evanlinjin

This one is the craziest yet: https://gist.github.com/evanlinjin/3079251115e64edc355f46d0d63edb1e

evanlinjin avatar Jul 13 '22 09:07 evanlinjin

Yes.. I have been seeing this one in my local..

rajarshimaitra avatar Jul 13 '22 14:07 rajarshimaitra

Yes.. I have been seeing this one in my local..

I mean if the error states "Resource TEMPORARILY unavailable", I think we can safely wrap the calls with exponential back-off logic. I did see a function in the tests for exponential back-off, so maybe we can reuse that.

evanlinjin avatar Jul 13 '22 23:07 evanlinjin

My guess: the tcp send buffer gets filled up because core can't process all those requests at once. Increasing the number of threads would probably help.

Source https://linux.die.net/man/2/send:

When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in nonblocking I/O mode. In nonblocking mode it would fail with the error EAGAIN or EWOULDBLOCK in this case

afilini avatar Jul 14 '22 10:07 afilini

I ran into this issue testing bitcoindevkit/bdk-cli#102 with the new regtest-bitcoin feature which starts up a local bitcoind and connects to it via the RPC client, but the error only seemed to happen on my MacOS system, the CI tests all pass on Ubuntu Linux. While debugging I noticed in the bitcoind logs a list of RPC requests to "Update existing descriptor: raw(...)" seems to be the default 100 address look ahead, so I think the requests are all getting successfully delivered but the RPC client connection disconnects always after 5 seconds and throws the SocketError "Resource temporarily unavailable". I'll try making the bitcoind server responded faster (as @afilini suggested with more threads), and separately see if I can make the RPC client timeout longer and report back.

notmandatory avatar Sep 10 '22 18:09 notmandatory

I re-tried my test with more rpcthreads set on bitcoind with conf set to via "-rpcthread=", I tired 20 and 50, didn't seem to make a difference. I haven't find a way yet to change the RPC call socket timeout.

notmandatory avatar Sep 10 '22 20:09 notmandatory

Probably best to disregard my above RPC problems since the title of this issue is really about esplora. I'll open a new issue for my RPC macos problem.

notmandatory avatar Sep 10 '22 20:09 notmandatory

Could this issue be related to the default number for file descriptors running out? This is mentioned for core here: JSON-RPC-interface limitations.

notmandatory avatar Mar 02 '23 17:03 notmandatory

As per https://docs.oracle.com/cd/E19476-01/821-0505/file-descriptor-requirements.html, the default limit for file descriptors per process is 1024. I am not fully sure, but I think we are not hitting that many RPC requests.

Another pointer from LDK team was, it should just redo the call again if it hits the error. https://github.com/lightningdevkit/rust-lightning/pull/2033#issuecomment-1432018010

I will try with this on the new RPC example and report back..

rajarshimaitra avatar Mar 26 '23 04:03 rajarshimaitra