Accepting illegal UTF-8 in strings
utop # let str = "\"Foo\xc0\xafBar\"";;
val str : string = "\"FooÀ¯Bar\""
utop # Yojson.Basic.from_string str;;
- : Yojson.Basic.t = `String "FooÀ¯Bar"
utop # String.is_valid_utf_8 str;;
- : bool = false
I believe the string above is not legal UTF-8 but is accepted by the parser. Given that JSON is typically read from an external source it would be best to detect illegal UTF-8 during parsing.
There is a reference to failing UTF* tests here too: https://github.com/ocaml-community/yojson/issues/34#issue-185524071, which claims that they are a design choice (although I can't find a reference to this anywhere in the documentation, maybe I'm looking in the wrong place).
Even if it is a design choice, it could be offered as an optional feature (validating UTF-8 has a small performance penalty, but there is a good UTF-8 decoder in the OCaml standard library now).
Hmm, at first I was confused why UTF-8 is an issue but reading RFC 8259, section 8.1 I get
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].
Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.
This is somewhat unfortunate given Yojson predates this RFC and OCaml has a tendency to treat strings as byte-strings (and the lowest currently supported version of OCaml does not contain the UTF-8 validation code), so I guess adding the validation would be a breaking change for Yojson 4.
I'd rather not make any things more opt-in because I am not a fan of putting the onus on the user to pick the "right" options, I'm currently on the way of trying to reduce the amount of confusion and making the codebase easier to read. There are always requests for different parsing results but all of these options lead to a combinatorial explosion of options so if the issue is fairly clear-cut I'd rather just take that.
I'd rather not make any things more opt-in
if a breaking change is acceptable (e.g. using semantic versioning and bumping the major number), then making Yojson conform to the latest JSON spec by default would be good. I only suggested making it opt-in for potential backward compatibility reasons.
I would suggest to use the UTF-8 decoder that is (now) part of the OCaml standard library rather the one implemented in Yojson. Another option is to expect valid UTF-8 and require the user to ensure this - but this would need to be documented.