YAML-PP-p5
YAML-PP-p5 copied to clipboard
Anti-issue: YAML::PP parses JSON that all the other perl JSON modules can't!
So yeah, this is an anti-issue - I discovered recently that JSON is "a subset of YAML 1.2"; and then discovered YAML::PP. In short: Thank you. YAML::PP doesn't bomb on JSON that is produced with ham-fisted UTF-8 encoding.
It appears that one company in particular that distributes a data feed has somehow "switched on" interpreting all data ingested as UTF-8, even when it wasn't UTF-8 encoded. Imagine interpreting the header of a ZIP file as Unicode. The result is corrupted garbage, and it isn't standards compliant.
Example:
{"Subject": "CN=\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\u0531/OU=\ufffd\ufffd\u01b4\ufffd/OU=\u027d\ufffd\ufffd\ufffd\ude64\ufffd\ufffd\u0467/O=sdlg" }
Nothing else in Perl land seems to be able to parse the above JSON document. YAML::PP does, as of v0.005.
My request: Please let this continue to be the case. If you do end up adding validation of unicode character sequences, give folks an option to turn it off.
Heh, that's funny. My plan actually is to fix that because, as you said yourself, it's invalid (so the JSON modules saying "missing high surrogate character in surrogate pair" are right). But I see from your example that allowing to turn validation off can actually be helpful.
I'll leave this open until I added validation and a corresponding configuration option.
Thank you! And yes, it's been a frustrating experience; the folks who generate the data feed don't seem to think it's their problem to solve. :frowning_face:
Just a comment: jq seems to handle this input without complaints.
@choroba what version of JQ? Last I checked, JQ was still throwing errors on this sort of badly formed JSON.
jq-1.5. It seems 1.6 should be around, too, so maybe it's different.
Okay, so yes; JQ does parse the above example snippet of butchered UTF JSON; but I've got worse examples that JQ barfs on from the same data feed. Either way, YAML-PP is still the only way I can reliably parse this kind of JSON in perl, and I'm super happy that it still works.
@warewolf That JSON is invalid due to unpaired half of surrogate pair. How would like you handle and decode invalid JSON? Such string does not have representation in UTF-8, so you cannot load & decode it. I see there two options: 1) Skip every non-parsable byte in input or 2) Replace non-parsable tokens in JSON string by Unicode replacement character. But both options changes input, so when processing it in Perl you would have something different.
I understand analytical reasons trying to process as many data as possible, but when on input are invalid data it is needed to specify how to non-reversible handle them.
@pali well, because of how the JSON is already mangled (non UTF-8 interpreted as UTF-8, which gets completely fubar "this can't be represented in UTF-8") I honestly don't expect this to to be reversible to something consistent. For my use, the actual string values that are corrupted are irrelevant, the rest of the JSON structure I'm parsing does have value, so for me the important part is not bailing on parsing the entire JSON object.
Sadly I can't fix the origin data because it's from a commercial data feed, and apparently python gladly will serialize to invalid JSON?
I can imagine that maintainer of Cpanel::JSON::XS could accept optional feature to process also invalid JSON strings and replace invalid characters by Unicode replacement character. So if you have really use cases (which seems that yes), open an issue/feature request for Cpanel::JSON::XS.