cargo icon indicating copy to clipboard operation
cargo copied to clipboard

spurious network errors while running CI for Solaris on vmactions/solaris-vm

Open psumbera opened this issue 11 months ago • 10 comments

Problem

We have started to see following network errors:

  warning: spurious network error (3 tries remaining): [28] Timeout was reached (Operation too slow. Less than 10 bytes/sec transferred the last 30 seconds)

as reported here: https://github.com/quinn-rs/quinn/issues/2218

Note that it's running Solaris under QEMU in Linux. I guess that there might have been some changes to QEMU and/or Linux underneeth which enabled these problems. Unfortunatelly main victim seems to be Solaris.

I have come with following workaround. But the cost seems to be longer times...

https://github.com/quinn-rs/quinn/pull/2243

psumbera avatar May 13 '25 08:05 psumbera

Hmm… so 1.86.0 has caused some issues? We did bump curl to from 8.9.0 to 8.12.0 in Rust 1.86: https://github.com/rust-lang/cargo/pull/15162. I wonder if it is possible for you to downgrade to old curl and see what is going on.

This environment could also be set to show more useful debugging info:

CARGO_HTTP_DEBUG=true CARGO_LOG=cargo::ops::registry=debug cargo fetch

weihanglo avatar May 13 '25 16:05 weihanglo

Hmm… so 1.86.0 has caused some issues? We did bump curl to from 8.9.0 to 8.12.0 in Rust 1.86: #15162. I wonder if it is possible for you to downgrade to old curl and see what is going on.

I don't think it's related to cargo version. I have tried version 1.82 and it had the same problems. But originally when Solaris CI was integrated it used this version (1.82) for some time without issue.

psumbera avatar May 13 '25 17:05 psumbera

Workaround seems to be to set CARGO_HTTP_MULTIPLEXING=false.

psumbera avatar May 15 '25 09:05 psumbera

Workaround seems to be to set CARGO_HTTP_MULTIPLEXING=false.

Cargo already has some built-in workaround for disabling multiplexing under some circumstances. Like this and this. However, due to the unusual interaction (Solaris under QEMU), not sure if we'd like to have a similar approach built-in before knowing the root cause.

weihanglo avatar May 15 '25 13:05 weihanglo

This environment could also be set to show more useful debugging info:

CARGO_HTTP_DEBUG=true CARGO_LOG=cargo::ops::registry=debug cargo fetch

I don't see any debug info with this:

CARGO_HTTP_DEBUG=true CARGO_LOG=cargo::ops::registry=debug cargo fetch
  cargo build --locked --all-targets && cargo test --locked --manifest-path fuzz/Cargo.toml && cargo test --locked -p quinn-udp --benches
  /usr/bin/bash /home/runner/work/_actions/vmactions/solaris-vm/v1/run.sh execSSHSH
  Config file: solaris-11.4-gcc.conf
      Updating crates.io index
   Downloading crates ...
    Downloaded windows-targets v0.52.6
    Downloaded untrusted v0.9.0
    Downloaded subtle v2.6.1
    Downloaded write16 v1.0.0
    Downloaded icu_provider_macros v1.5.0
  warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 229231 bytes remaining to read)
  warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 29962 bytes remaining to read)
  warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 47660 bytes remaining to read)
  warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 52722 bytes remaining to read)
  warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 16971 bytes remaining to read)
  warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 7525 bytes remaining to read)
  warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 29673 bytes remaining to read)
  warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 18327 bytes remaining to read)
  warning: spurious network error (3 tries remaining): [28] Timeout was reached (download of `zeroize v1.8.1` failed to transfer more than 1 bytes in 90s)
  warning: spurious network error (3 tries remaining): [28] Timeout was reached (failed to download any data for `rustls v0.23.27` within 90s)

psumbera avatar May 15 '25 14:05 psumbera

Ok I now see it also on my VirtualBox. So I don't think it's QEMU related.

test@virt75:~/quinn-ci-test2$ CARGO_HTTP_DEBUG=true CARGO_LOG=cargo::ops::registry=debug cargo fetch
    Updating crates.io index
  Downloaded wasm-bindgen-macro v0.2.100
  Downloaded subtle v2.6.1
  Downloaded unicode-ident v1.0.18
  Downloaded bumpalo v3.17.0
  Downloaded wasm-bindgen-backend v0.2.100
  Downloaded rustls-webpki v0.103.3
  Downloaded synstructure v0.13.1
  Downloaded zeroize v1.8.1
  Downloaded wasm-bindgen-macro-support v0.2.100
  Downloaded pin-project v1.1.10
warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 8869 bytes remaining to read)
  Downloaded matchers v0.1.0
  Downloaded web-time v1.1.0
  Downloaded js-sys v0.3.77
  Downloaded rustversion v1.0.20
  Downloaded quote v1.0.40
  Downloaded value-bag v1.11.1
  Downloaded signal-hook-registry v1.4.2
  Downloaded rustls-pki-types v1.11.0
  Downloaded windows-targets v0.52.6
  Downloaded rustls v0.23.27
  Downloaded untrusted v0.9.0
  Downloaded wasm-bindgen v0.2.100
  Downloaded thiserror-impl v2.0.12
  Downloaded adler2 v2.0.0
  Downloaded tracing-futures v0.2.5
warning: spurious network error (3 tries remaining): [16] Error in the HTTP2 framing layer
warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 27991 bytes remaining to read)
warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 16310 bytes remaining to read)
warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 58239 bytes remaining to read)
  Downloaded cfg-if v1.0.0
warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 49252 bytes remaining to read)
warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 34446 bytes remaining to read)
warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 10223 bytes remaining to read)
warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 16970 bytes remaining to read)
warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 42691 bytes remaining to read)
warning: spurious network error (3 tries remaining): [18] Transferred a partial file (transfer closed with 22654 bytes remaining to read)
  Downloaded wasm-bindgen-shared v0.2.100
  Downloaded shlex v1.3.0
error: failed to verify the checksum of `shlex v1.3.0`

Maybe it can be related to old CURL 7.79.0 (Old Solaris CBE release 11.4.42). While on baremetal with latest Solaris system (CURL 8.12.0) I don't see it...

psumbera avatar May 15 '25 14:05 psumbera

The aforementioned patch looks like this:

https://github.com/rust-lang/cargo/blob/47c911e9e6f6461f90ce19142031fe16876a3b95/src/cargo/util/context/mod.rs?plain=1#L3117-L3143

We could do stuff once we know the pattern of the failure. (sorry I don't have those machines so can only rely on people to test it 😞)

weihanglo avatar May 15 '25 15:05 weihanglo

Ok, this can be fixed also via:

--- rustc-1.86.0-src/src/tools/cargo/src/cargo/util/context/mod.rs
+++ rustc-1.86.0-src/src/tools/cargo/src/cargo/util/context/mod.rs
@@ -3020,7 +3020,7 @@
     use crate::util::network;

     if network::proxy::http_proxy_exists(http, gctx) && http.multiplexing.is_none() {
-        let bad_curl_versions = ["7.87.0", "7.88.0", "7.88.1"];
+        let bad_curl_versions = ["7.79.0", "7.87.0", "7.88.0", "7.88.1"];
         if bad_curl_versions
             .iter()
             .any(|v| curl_version.starts_with(v))

Note that CURL is 7.79.0 is bundled with older version of Oracle Solaris CBE 11.4.42 which is still in use with Github vmactions/solaris-vm.

psumbera avatar May 16 '25 11:05 psumbera

This environment could also be set to show more useful debugging info:

CARGO_HTTP_DEBUG=true CARGO_LOG=cargo::ops::registry=debug cargo fetch

I don't see any debug info with this:

Sorry should be something lke

CARGO_HTTP_DEBUG=true CARGO_LOG=network=debug cargo fetch

weihanglo avatar May 17 '25 03:05 weihanglo

Sorry should be something lke

CARGO_HTTP_DEBUG=true CARGO_LOG=network=debug cargo fetch

Thank you! Now there is too much data...

fetch.log

psumbera avatar May 19 '25 09:05 psumbera

You probably want to disable libcurl low speed limit in any case with CARGO_HTTP_LOW_SPEED_LIMIT=0. Default is 10 which means if less than 10 bytes are downloaded per timeout (30 seconds), connection is aborted. It is enough to have a 30-second timeout, if nothing is read in 30 seconds then you abort the connection. This libcurl feature is a frequent source of intermittent failures and it's rare that reconnecting makes the speed higher. I don't like that it is configured by default.

You will likely still get timeouts if connection got stuck completely, but at least you will know that you have not received anything at all for 30 seconds.

link2xt avatar Aug 13 '25 01:08 link2xt