spin icon indicating copy to clipboard operation
spin copied to clipboard

Intermittent wasi-http test failures

Open vdice opened this issue 2 years ago • 8 comments

Seeing sporadic integration_tests::test_wasi_http_double_echo test failures; recently in https://github.com/fermyon/spin/actions/runs/6724485363/job/18276859959?pr=2022 (unrelated PR).

I say sporadic because it has hit another PR and a rerun of the job was successful. Race condition?

Error output:

Error receiving body: error reading a body from connection: unexpected EOF during chunk size line

Caused by:
    unexpected EOF during chunk size line
[2023](https://github.com/fermyon/spin/actions/runs/6724485363/job/18276859959?pr=2022#step:4:2024)-11-01T20:24:07.377657Z TRACE spin_trigger_http::handler: wasi-http memory consumed: 1441792
2023-11-01T20:24:07.378647Z  WARN spin_trigger_http: hyper::Error(BodyWrite, Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })    
thread 'integration_tests::test_wasi_http_double_echo' panicked at 'body content mismatch (expected length 1048576; actual length 557056)', tests/integration.rs:838:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test integration_tests::test_wasi_http_double_echo ... FAILED

vdice avatar Nov 01 '23 20:11 vdice

@dicej are you best placed to look at this?

itowlson avatar Nov 01 '23 20:11 itowlson

I've been running the test over and over in a bash while loop for some time now -- no failures yet. Maybe it's some kind of interaction with the other tests running in parallel. I'll keep trying to repro.

dicej avatar Nov 01 '23 22:11 dicej

Running all the integration tests in parallel did eventually fail, but this time it was a different error on test:

failures:

---- integration_tests::test_wasi_http_hash_all stdout ----
Error: error sending request for url (http://127.0.0.1:63836/hash-all): connection error: Connection reset by peer (os error 54)

Caused by:
    0: connection error: Connection reset by peer (os error 54)
    1: Connection reset by peer (os error 54)


failures:
    integration_tests::test_wasi_http_hash_all

dicej avatar Nov 01 '23 22:11 dicej

I think https://github.com/fermyon/spin/pull/2019 will help us diagnose these better.

dicej avatar Nov 02 '23 14:11 dicej

As far as I can tell, we haven't been seeing this failure recently. Closing for now; please reopen if spotted.

vdice avatar Nov 08 '23 20:11 vdice

Sure enough, closing the issue prompted another seemingy related flake in https://github.com/fermyon/spin/actions/runs/6803772480/job/18499823072

STDERR:
Warning: You're using a pre-release version of Spin (2.1.0-pre0). This plugin might not be compatible (supported: >=0.5). Continuing anyway.

Error: error sending request for url (http://127.0.0.1:35985/test/hello): error trying to connect: tcp connect error: Connection refused (os error 111)

Caused by:
    0: error trying to connect: tcp connect error: Connection refused (os error 111)
    1: tcp connect error: Connection refused (os error 111)
    2: Connection refused (os error 111)
test integration_tests::test_simple_rust_local ... FAILED

Re-opening and updating the wording of this issue to reflect that it doesn't appear to be caused by any one test...

vdice avatar Nov 08 '23 22:11 vdice

We might need to punt and retry any tests that use the network up to N times if there's no obvious culprit here. It might also help to force them to run serially, e.g. by holding a mutex for the duration of each test.

dicej avatar Nov 09 '23 00:11 dicej

One other thought: I know that both Linux and MacOS have kernel-level heuristics for detecting and mitigating TCP SYN flood attacks, which I've had to explicitly disable when running load tests in the past since they would otherwise cause spurious network errors. It could be that we're triggering those heuristics here; even if the total number of connections we're making is relatively small, the fact that we're opening and closing connections very quickly and in parallel may be enough to trigger them. In that case, serializing the integration tests may be a viable workaround.

dicej avatar Nov 10 '23 15:11 dicej