spin
spin copied to clipboard
Intermittent wasi-http test failures
Seeing sporadic integration_tests::test_wasi_http_double_echo test failures; recently in https://github.com/fermyon/spin/actions/runs/6724485363/job/18276859959?pr=2022 (unrelated PR).
I say sporadic because it has hit another PR and a rerun of the job was successful. Race condition?
Error output:
Error receiving body: error reading a body from connection: unexpected EOF during chunk size line
Caused by:
unexpected EOF during chunk size line
[2023](https://github.com/fermyon/spin/actions/runs/6724485363/job/18276859959?pr=2022#step:4:2024)-11-01T20:24:07.377657Z TRACE spin_trigger_http::handler: wasi-http memory consumed: 1441792
2023-11-01T20:24:07.378647Z WARN spin_trigger_http: hyper::Error(BodyWrite, Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })
thread 'integration_tests::test_wasi_http_double_echo' panicked at 'body content mismatch (expected length 1048576; actual length 557056)', tests/integration.rs:838:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test integration_tests::test_wasi_http_double_echo ... FAILED
@dicej are you best placed to look at this?
I've been running the test over and over in a bash while loop for some time now -- no failures yet. Maybe it's some kind of interaction with the other tests running in parallel. I'll keep trying to repro.
Running all the integration tests in parallel did eventually fail, but this time it was a different error on test:
failures:
---- integration_tests::test_wasi_http_hash_all stdout ----
Error: error sending request for url (http://127.0.0.1:63836/hash-all): connection error: Connection reset by peer (os error 54)
Caused by:
0: connection error: Connection reset by peer (os error 54)
1: Connection reset by peer (os error 54)
failures:
integration_tests::test_wasi_http_hash_all
I think https://github.com/fermyon/spin/pull/2019 will help us diagnose these better.
As far as I can tell, we haven't been seeing this failure recently. Closing for now; please reopen if spotted.
Sure enough, closing the issue prompted another seemingy related flake in https://github.com/fermyon/spin/actions/runs/6803772480/job/18499823072
STDERR:
Warning: You're using a pre-release version of Spin (2.1.0-pre0). This plugin might not be compatible (supported: >=0.5). Continuing anyway.
Error: error sending request for url (http://127.0.0.1:35985/test/hello): error trying to connect: tcp connect error: Connection refused (os error 111)
Caused by:
0: error trying to connect: tcp connect error: Connection refused (os error 111)
1: tcp connect error: Connection refused (os error 111)
2: Connection refused (os error 111)
test integration_tests::test_simple_rust_local ... FAILED
Re-opening and updating the wording of this issue to reflect that it doesn't appear to be caused by any one test...
We might need to punt and retry any tests that use the network up to N times if there's no obvious culprit here. It might also help to force them to run serially, e.g. by holding a mutex for the duration of each test.
One other thought: I know that both Linux and MacOS have kernel-level heuristics for detecting and mitigating TCP SYN flood attacks, which I've had to explicitly disable when running load tests in the past since they would otherwise cause spurious network errors. It could be that we're triggering those heuristics here; even if the total number of connections we're making is relatively small, the fact that we're opening and closing connections very quickly and in parallel may be enough to trigger them. In that case, serializing the integration tests may be a viable workaround.