surf icon indicating copy to clipboard operation
surf copied to clipboard

"Error: could not decode body as UTF-8" on some websites.

Open Shnatsel opened this issue 4 years ago • 8 comments

On some websites, e.g. http://orgonelab.org, surf fails with the following error:

Error: could not decode body as UTF-8

It is also sometimes reported for other encodings.

Firefox, curl and ureq (a blocking Rust client) work fine.

28348 websites out of the top million from Feb 3 Tranco list are affected.

Tested using this code. Test tool output from all affected websites: surf-could-not-decode-as.tar.gz

I've only tested the async-h1 backend; I don't know if the other backends are affected.

Shnatsel avatar Feb 27 '21 16:02 Shnatsel

RUST_BACKTRACE=1 longboard GET http://orgonelab.org
Error: could not decode body as UTF-8

Fishrock123 avatar Mar 01 '21 18:03 Fishrock123

I'm not convinced that this is a bug. The sample code calls Response::body_string, which checks the encoding. If the body is not the correct encoding, it returns an error. The body of this response is not valid utf8: image

If this is not the desired behavior, the alternative is to use Response::body_bytes() and then perform whatever lossy decoding is appropriate for the use case

jbr avatar Mar 01 '21 22:03 jbr

we could potentially add a Response::body_string_lossy, but I'm not sure if that's a substantial improvement over String::from_utf8_lossy(response.body_bytes().await?) (or utf16, if that's appropriate). alternatively, we could just do a lossy conversion all the time, but i'm not sure if that's appropriate as default behavior

jbr avatar Mar 01 '21 23:03 jbr

This is also reported for a number of other encodings, and other HTTP clients do not report an error here - neither firefox (which uses encoding-rs) nor ureq (a client in Rust backed by encoding-rs) which I similarly forced to convert the body contents to string. I would be surprised to find that they both perform a lossy conversion.

Shnatsel avatar Mar 02 '21 00:03 Shnatsel

What would it mean to losslessly force body contents to string if it's not properly encoded? As per String docs, strings are always utf8-encoded. If the returned byte content is not valid utf8, the only way to return a string is with a lossy conversion.

ureq does a lossy conversion when the encoding feature is disabled, and ignores the return variable that indicates if the conversion failed when encoding is enabled. Surf checks for encoding failure and intentionally returns an error representing that failure.

Silently ignoring encoding errors is not necessarily a feature everyone wants. I'm not sure what the right choice is here, but it's not clear to me that copying the behavior of other clients is the right choice, and the code is certainly behaving as intended, whether or not that's the behavior we want going forward

jbr avatar Mar 02 '21 00:03 jbr

This behavior is documented at https://docs.rs/surf/2.1.0/surf/struct.Response.html#method.body_string

jbr avatar Mar 02 '21 00:03 jbr

Personally I'd expect an HTTP client to mirror the behavior of web browsers by default, but I agree the desired behavior in this case is not clear-cut.

Shnatsel avatar Mar 02 '21 00:03 Shnatsel

I think the existing escape hatch (read to byte buffer, do manual lossy to string) probably covers our bases here.

The docs on body_string() could probably be clearer, though.

Fishrock123 avatar Mar 02 '21 17:03 Fishrock123