surf Bug: Response::body_json could not convert from an UTF-8 BOM source file

Target

body_json function of Response struct
- https://docs.rs/surf/1.0.3/surf/struct.Response.html#method.body_json

Detail

body_json function could not convert from an UTF-8 BOM source file

Repro

Prepare an UTF-8 BOM JSON file.; BOM bytes is 0xef 0xbb 0xbf.
Deploy the JSON file and run a httpd.
surf::get and body_json => SerdeJsonError(Error("expected value", line: 1, column: 1))

Workaround

// It will be the error if response has the BOM.
// let my_struct = my_response.body_json::<MyStruct>();

Raw implementation:

let my_response_body_string = my_response.body_string();
let my_response_body_string = if my_response_body_string.starts_with("\u{feff}") { &my_response_body_string[3..] } else { &my_response_body_string[..] };
let my_struct: MyStruct = serde_json::from_str(my_response_body_string);

Or use strip_bom:

use strip_bom::StripBom;

let my_response_body_string = my_response.body_string();
let my_response_body_string = my_response_body_string.strip_bom();
let my_struct: MyStruct = serde_json::from_str(my_response_body_string);

Notes:

This error is SerdeJsonError, but a character encoding issue might not issue of serde_json or serde crate. Because, serde_json is designed for UTF-8 stream body only, not for a file directly or BOM. Thus, I think the BOM issue is not a serde_json or serde issue, it is a library user side issue. If you authors think it is not a surf issue, it is a serde_json or serde issue then I'll throw the issue to serde_json or serde repos.
- serde_json::from_reader (for an IO stream, ≈ Input UTF-8 character stream)
- serde_json::from_str (≈Input UTF-8 character stream).
It might be not a std::string::String issue because https://github.com/rust-lang/rfcs/issues/2428
JSON (ECMA-404, RFC8259) is defined the character encoding to use only UTF-8, but the BOM is not defined. Thus, JSON file with UTF-8 BOM is not an invalid in specification. Yes, of course I was think nobody who use UTF-8 BOM to JSON file before today, but a Japanese government institution used it. 🥺

Sep 10 '20 21:09 usagi

Is this still true in the 2.0 alphas? 2.0.0-alpha.0 and later include support for non-UTF8 encodings, but that code should also handle UTF-with-BOM.

Sep 11 '20 07:09 goto-bus-stop

@goto-bus-stop Unfortunately yes, this problem occur in these any versions:

surf = "1.0.3"
surf = {git = "https://github.com/http-rs/surf", tag = "v2.0.0-alpha.0"}
surf = {git = "https://github.com/http-rs/surf", tag = "v2.0.0-alpha.1"}
surf = {git = "https://github.com/http-rs/surf", tag = "v2.0.0-alpha.2"}
surf = {git = "https://github.com/http-rs/surf", tag = "v2.0.0-alpha.3"}
surf = {git = "https://github.com/http-rs/surf", tag = "v2.0.0-alpha.4"}
surf = {git = "https://github.com/http-rs/surf", tag = "v2.0.0-alpha.5"}

Example repro code

main.rs

use serde::{Deserialize, Serialize};

#[derive(Debug, Serialize, Deserialize)]
struct Sushi {
    sushi: String,
}

fn main() {
    // UTF-8N; `{ "sushi": "🍣" }`
    let mut r = smol::block_on(surf::get("http://localhost:50000/n-sushi.txt")).unwrap();
    let sushi = smol::block_on(r.body_json::<Sushi>()).unwrap();
    println!("n{:?}", &sushi);
    // UTF-8 BOM; `\u{feff}{ "sushi": "🍣" }`
    let mut r = smol::block_on(surf::get("http://localhost:50000/bom-sushi.txt")).unwrap();
    let sushi = smol::block_on(r.body_json::<Sushi>()).unwrap();
    println!("b{:?}", &sushi);
}

[dependencies] section of Cargo.toml:

[dependencies]
serde = {version = "1.0.115", features = ["derive"]}
serde_json = "1.0.57"
smol = "1.0.0"
surf = {git = "https://github.com/http-rs/surf", tag = "v2.0.0-alpha.5"}
#surf ="1.0.3"

Example UTF-8N and UTF-8 BOM files

hexdump --canonical n-sushi.txt # It's UTF-8N(Reference)

00000000  7b 20 22 73 75 73 68 69  22 3a 20 22 f0 9f 8d a3  |{ "sushi": "....|
00000010  22 20 7d                                          |" }|
00000013

hexdump --canonical bom-sushi.txt # It's UTF-8 BOM (Target of this test)

00000000  ef bb bf 7b 20 22 73 75  73 68 69 22 3a 20 22 f0  |...{ "sushi": ".|
00000010  9f 8d a3 22 20 7d                                 |..." }|
00000016

Repro procedure

simple-http-server -p 50000
cargo run

And then I got the result:

// cargo process

    Finished dev [unoptimized + debuginfo] target(s) in 0.41s
     Running `target\debug\se.exe`

// The result part of the first UTF-8N reference

nSushi { sushi: "🍣" }

// The result part of the second UTF-8 BOM testing

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: expected value at line 1 column 1', src\main.rs:15:56
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: process didn't exit successfully: `target\debug\se.exe` (exit code: 101)

Sep 11 '20 15:09 usagi