rust-protobuf text_format parsing does not correctly handle non-ascii chars?

text_format parsing does not correctly handle non-ascii chars?

Open cmyr opened this issue 1 year ago • 1 comments

It's possible I'm misreading this, but I'm running into issues trying to read text-format protobuf files that contain string literals with non-ascii characters. Reading through the source, the following really does not look correct:

https://github.com/stepancheg/rust-protobuf/blob/16c9dc509267a6673f29563f9a01cc3026cc2144/protobuf-support/src/lexer/lexer_impl.rs#L443-L479

basically this is consuming chars but only ever returning bytes, and it is converting a char (which represents a unicode scalar value) directly into a u8 which will then be interpreted as utf-8; but outside of ascii the integer value of a char does not correspond to the utf-8 encoding of a char. (for instance the char À has a unicode scalar value of 192, but is encoded as 0xC3, 0x80 in utf-8).

I don't think this is hard to fix; you just need to stay in chars the whole time, and avoid converting to bytes. Given that the text_format input is always valid utf-8 (since you parse from &str, which is always valid utf-8) it should not be possible for a string literal to ever not be valid utf-8.

I'm going to go ahead and write a patch for this and PR it preemptively, since I think it should be relatively trivial; will figure out a test case as well.

Jun 14 '24 20:06 cmyr

Hitting the same problem, and Colin's parse-unicode-strings branch works for me. Please merge!

Sep 03 '24 07:09 simoncozens

rust-protobuf rust-protobuf copied to clipboard

text_format parsing does not correctly handle non-ascii chars?

rust-protobuf
rust-protobuf copied to clipboard