reqwest-middleware icon indicating copy to clipboard operation
reqwest-middleware copied to clipboard

Marking additional `std::io::ErrorKind` variants as transient (Cloudflare bad TLS packets)

Open beanow-at-crabnebula opened this issue 1 month ago • 0 comments

Motivations

We were investigating flakiness with Cloudflare requests that already had a generous retry limit, but were flagged as Fatal by the default policy.

As it turns out, one of the errors looked like:

reqwest::Error {
	kind: Request,
	url: Url { ... },
	source: hyper_util::client::legacy::Error(SendRequest, hyper::Error(Io, Custom { kind: InvalidData, error: "received fatal alert: BadRecordMac" }))
}

There are various reports of this BadRecordMac (rustls) or ERR_SSL_BAD_RECORD_MAC_ALERT (openssl) when using Cloudflare. Retrying mitigates the issue, but since it's considered Fatal instead of Transient, the request fails.

Solution

Update classify_io_error to mark this error as transient.

fn classify_io_error(error: &std::io::Error) -> Retryable {
    match error.kind() {
-        std::io::ErrorKind::ConnectionReset | std::io::ErrorKind::ConnectionAborted => {
+        std::io::ErrorKind::ConnectionReset | std::io::ErrorKind::ConnectionAborted | std::io::ErrorKind::InvalidData => {
            Retryable::Transient
        }
        _ => Retryable::Fatal,
    }
}

Alternatives

Consider even more variants to be marked as transient. I haven't investigated all of them, but some that might be transient from their description:

  • https://doc.rust-lang.org/std/io/enum.ErrorKind.html#variant.BrokenPipe
  • https://doc.rust-lang.org/std/io/enum.ErrorKind.html#variant.TimedOut
  • https://doc.rust-lang.org/std/io/enum.ErrorKind.html#variant.Interrupted

Additional context

Tested with

  • reqwest-retry 0.7.0
  • reqwest-middleware 0.4.0
  • reqwest 0.12.4 (including rustls-tls-native-roots)
  • hyper 1.3.1

beanow-at-crabnebula avatar Jan 08 '25 16:01 beanow-at-crabnebula