"Error: could not decode body as UTF-8" on some websites.
On some websites, e.g. http://orgonelab.org, surf fails with the following error:
Error: could not decode body as UTF-8
It is also sometimes reported for other encodings.
Firefox, curl and ureq (a blocking Rust client) work fine.
28348 websites out of the top million from Feb 3 Tranco list are affected.
Tested using this code. Test tool output from all affected websites: surf-could-not-decode-as.tar.gz
I've only tested the async-h1 backend; I don't know if the other backends are affected.
RUST_BACKTRACE=1 longboard GET http://orgonelab.org
Error: could not decode body as UTF-8
I'm not convinced that this is a bug. The sample code calls Response::body_string, which checks the encoding. If the body is not the correct encoding, it returns an error. The body of this response is not valid utf8:

If this is not the desired behavior, the alternative is to use Response::body_bytes() and then perform whatever lossy decoding is appropriate for the use case
we could potentially add a Response::body_string_lossy, but I'm not sure if that's a substantial improvement over String::from_utf8_lossy(response.body_bytes().await?) (or utf16, if that's appropriate). alternatively, we could just do a lossy conversion all the time, but i'm not sure if that's appropriate as default behavior
This is also reported for a number of other encodings, and other HTTP clients do not report an error here - neither firefox (which uses encoding-rs) nor ureq (a client in Rust backed by encoding-rs) which I similarly forced to convert the body contents to string. I would be surprised to find that they both perform a lossy conversion.
What would it mean to losslessly force body contents to string if it's not properly encoded? As per String docs, strings are always utf8-encoded. If the returned byte content is not valid utf8, the only way to return a string is with a lossy conversion.
ureq does a lossy conversion when the encoding feature is disabled, and ignores the return variable that indicates if the conversion failed when encoding is enabled. Surf checks for encoding failure and intentionally returns an error representing that failure.
Silently ignoring encoding errors is not necessarily a feature everyone wants. I'm not sure what the right choice is here, but it's not clear to me that copying the behavior of other clients is the right choice, and the code is certainly behaving as intended, whether or not that's the behavior we want going forward
This behavior is documented at https://docs.rs/surf/2.1.0/surf/struct.Response.html#method.body_string
Personally I'd expect an HTTP client to mirror the behavior of web browsers by default, but I agree the desired behavior in this case is not clear-cut.
I think the existing escape hatch (read to byte buffer, do manual lossy to string) probably covers our bases here.
The docs on body_string() could probably be clearer, though.