protobuf icon indicating copy to clipboard operation
protobuf copied to clipboard

How to handle parse of persisted objects due to new UTF-8 validation

Open imirkin opened this issue 6 years ago • 8 comments

"Recently" (i.e. since the last time we performed a vendored code update), strings gained UTF-8 validation of data. So now parsing objects stored in a database or log files no longer works, if they had any offending sequences (which, generically, could be in any string).

The solutions I see:

  1. Use "proto2". This is a non-starter since it would require a rewrite of all proto definitions (and, I think, code).
  2. Do a global s/string/bytes/ and adjust all the code to wrap each field access in string() and write in a []byte(). This is a lot of replacement, but doable. However printing objects will be nasty now, we can no longer printf("%v") it and expect to see a string.
  3. Set validateUTF8 = false in the vendored code instead of = true. Easy enough, but begins a maintenance burden and moves us away from upstream.
  4. At every Marshal/Unmarshal site, check the returned error for being of type "interface{ InvalidUTF8() bool }". Seems feasible, esp if we provide our own wrappers for Marshal/Unmarshal and update all the existing call sites to use the new wrappers.

I was wondering if the protobuf team (or anyone else) had any other suggestions or opinions on good ways to address this.

imirkin avatar Aug 08 '19 14:08 imirkin

A central problem is that by the protobuf standards for a string (which applies for proto3 as well as proto2) apply and say:

A string must always contain UTF-8 encoded or 7-bit ASCII text, and cannot be longer than 2³².

If you’re storing raw binary data in a protobuf string type, then that is out of standard. For cases where you have non-utf8 values in a byte sequence, you should use bytes. (And likely codepage translate those values into a utf-8 Go string)

Not converting your encoding means that even if you were outputting them as strings, you would get weird replacement-character artifacts with printf("%v"): https://play.golang.org/p/7T0-3Jvr7jE

It might not be “convenient” but it’s the in-standards way to do things.

puellanivis avatar Aug 09 '19 09:08 puellanivis

The data really is strings though, mostly encoded as UTF-8 sequences, with some bad data that somehow snuck in. I have no idea how, or which string fields it's in. Could be anywhere as far as I'm concerned. I don't care about the invalid characters (in fact I'd love to fix them to be valid), so showing U+FFFE is exactly the right thing to do in those cases -- might occasion someone to actually fix the values when practical. That's how the pre-update proto worked anyways.

imirkin avatar Aug 09 '19 14:08 imirkin

Use "proto2". This is a non-starter since it would require a rewrite of all proto definitions (and, I think, code).

Note that proto2 is also specified to require UTF-8. It's just that almost no language implementation enforces today it for proto2, but almost all languages enforce it for proto3. You never know if one day there's a mandate to enforce it in all languages for proto2 as well.

Supposing that there were a proto field option like go_type that allowed you to specify the generated Go type, would this be able to help you? For example:

Before:

message Foo {
    string my_string = 5;
}

After:

message Foo {
    bytes my_string = 5 [go_type = "string"];
}

This would enable the generated Go to still use the Go string type, but not have UTF-8 validation since the protobuf field type is a bytes. This ensures that your generated Go code remains compatible. However, this assumes that you're only generating for Go. If you are also using protobufs in other languages, changing the protobuf type from string to bytes might break generated usages in other languages.

\cc @neild

dsnet avatar Aug 09 '19 16:08 dsnet

My specific use-case involves Go and protobuf.js. bytes in JS would become ByteBuffer everywhere. But there could be a similar option for js.

FWIW in the short run, we've just implemented the validateUTF8 change in the proto source, since we had to get going.

I think a global option for disabling utf-8 validation would be the best thing. It sprung up on me by surprise - I'm a lot more used to thinking about it as wire encodings, and bytes == string there. And the rest is just presenting a nice language wrapper. [And yeah, you can point at the spec, but if it's not enforced, it's not really there. And switching it up causes all kinds of issues such as these, where you end up with bad data that can no longer be decoded. Imagine performing this replacement at Google company-wide on all protos, including the logs ones... just wouldn't fly.]

protobuf is stuck between a rock and a hard place. On the one hand, you have reasonable languages, like C and Go, which don't try to solve string encodings in the string data type. On the other hand you have unreasonable languages like Java where a String is a sequence of code points, although they can definitely be invalid ones. So what do you do when decoding a sequence that doesn't decode?

An answer could be to just let it be -- languages that force it do the validation, languages that don't care keep working fine, and don't have to take a perf hit any time a string is used.

imirkin avatar Aug 09 '19 16:08 imirkin

I think a global option for disabling utf-8 validation would be the best thing.

I've advocated for this, but the protobuf team is afraid of prolific use of the option where it becomes increasingly problematic for language implementations that can't handle invalid UTF-8 in strings, which is a reasonable concern.

protobuf is stuck between a rock and a hard place

Yep. Thanks for understanding that.

dsnet avatar Aug 09 '19 17:08 dsnet

By the way, in my research on this, I also found

optional bool java_string_check_utf8 = 27 [default=false];

in descriptor.proto. No idea what its implications are for the Java protobuf generator... although a very cursory look at

https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/compiler/java/java_string_field.cc#L157

makes it seem like they handle non-utf8 sequences just fine. Could be other things which influence that logic though that I'm missing.

imirkin avatar Aug 09 '19 17:08 imirkin

Nice find. The comment in the file you pointed at says:

It wouldn't roundtrip byte arrays that were not valid UTF-8 encoded strings

This seems to suggest that they sanitize invalid UTF-8 as some valid string, perhaps similar to the Go strings.ToValidUTF8 function. As a result, it means that data will be silently corrupted.

As such, I'm not sure I would claim that this option allows Java to handle non-UTF8 "just fine".

dsnet avatar Aug 09 '19 17:08 dsnet

The full comment starts with "This code used to just store a String" which then points to some problems. However they resolved it with cleverness.

imirkin avatar Aug 09 '19 17:08 imirkin