`std.decodeUTF8` has different behavior from reference C++ implementation and Go implementation

Open juliajohannesen opened this issue 11 months ago • 1 comments

In both the reference C++ implementation and the Go implementation of Jsonnet, std.decodeUTF8 will perform conversion lossily- if an invalid codepoint is encountered in the string, it will be converted to � (U+FFFD, REPLACEMENT CHARACTER) and then continue to decode the rest of the string:

$ jsonnet --version
Jsonnet commandline interpreter (Go implementation) v0.20.0
$ jsonnet - <<< "std.decodeUTF8(std.encodeUTF8('foo bar ') + [255] + std.encodeUTF8(' baz'))"
"foo bar � baz"

Jrsonnets standard library instead throws an error when it encounters an invalid character:

$ jrsonnet --version
jrsonnet 0.5.0-pre96
$ jrsonnet - <<< "std.decodeUTF8(std.encodeUTF8('foo bar ') + [255] + std.encodeUTF8(' baz'))"
runtime error: bad utf8
    <stdin>:1:1-77: function <builtin_decode_utf8> call

A simple approach here would be to use [u8]::utf8_chunks on the underlying &[u8] of the IBytes, pushing U+FFFD when encountering invalid chunks, as shown in the documentation for the iterator returned by this method. I can put a PR for this together sometime tomorrow evening.

Mar 27 '25 03:03 juliajohannesen

There is also https://doc.rust-lang.org/stable/std/string/struct.String.html#method.from_utf8_lossy suitable for that. I wonder if signature of decodeUTF8 should be updated to decodeUTF8(string, lossy = true), I don't like implicit lossy conversions...

Apr 23 '25 21:04 CertainLach