Fix UTF-8 codepoint split by FormatterWriter

Open ilammy opened this issue 3 years ago • 0 comments

FormatterWriter has to deal with an inherent conflict: fmt::Formatter wants to write &str values (required to be valid UTF-8) while io::Write can be used to write arbitrary byte slices. This causes JSON helper to fail with certain non-ASCII strings.

SIMD-optimized JSON string formatting can cause writes that split UTF-8 codepoints, causing str::from_utf8() to fail since the input buffer for FormatterWriter has a chunk of UTF-8 codepoint at the end.

Consider the string from the new test:

"🤨🤨\n😮😮\n🤨🤨\n😮😮OMG"

which is encoded and processed like this:

 F0 9F A4 A8 F0 9F A4 A8 0A F0 9F 98 AE F0 9F 98 AE 0A F0 9F A4 A8 F0 9F A4 A8 0A F0 9F 98 AE F0 9F 98 AE 4F 4D 47   UTF-8 string

|           |           |  |           |           |  |           |           |  |           |           |  |  |  |  UTF-8 codepoint boundaries
                                        -----------                                           -----------
|                                               |                                               |                    i128 boundaries

In order to deal with it, FormatterWriter can write only the part which is valid UTF-8, keeping fmt::Formatter happy. io::Write allows partial write but its users have to be ready for that. Teach write_json_simd() to be ready by accounting how many bytes have been written, and if write_json_nosimd_prevalidated() doesn't write everything then the suffix gets written on the next call.

Benchmarks don't reveal any change in performance caused by that extra byte counting.

Aug 30 '22 02:08 ilammy