YAML-PP-p5 icon indicating copy to clipboard operation
YAML-PP-p5 copied to clipboard

Anti-issue: YAML::PP parses JSON that all the other perl JSON modules can't!

Open warewolf opened this issue 7 years ago • 9 comments

So yeah, this is an anti-issue - I discovered recently that JSON is "a subset of YAML 1.2"; and then discovered YAML::PP. In short: Thank you. YAML::PP doesn't bomb on JSON that is produced with ham-fisted UTF-8 encoding.

It appears that one company in particular that distributes a data feed has somehow "switched on" interpreting all data ingested as UTF-8, even when it wasn't UTF-8 encoded. Imagine interpreting the header of a ZIP file as Unicode. The result is corrupted garbage, and it isn't standards compliant.

Example: {"Subject": "CN=\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\u0531/OU=\ufffd\ufffd\u01b4\ufffd/OU=\u027d\ufffd\ufffd\ufffd\ude64\ufffd\ufffd\u0467/O=sdlg" }

Nothing else in Perl land seems to be able to parse the above JSON document. YAML::PP does, as of v0.005.

My request: Please let this continue to be the case. If you do end up adding validation of unicode character sequences, give folks an option to turn it off.

warewolf avatar Mar 27 '18 02:03 warewolf

Heh, that's funny. My plan actually is to fix that because, as you said yourself, it's invalid (so the JSON modules saying "missing high surrogate character in surrogate pair" are right). But I see from your example that allowing to turn validation off can actually be helpful.

I'll leave this open until I added validation and a corresponding configuration option.

perlpunk avatar Mar 27 '18 13:03 perlpunk

Thank you! And yes, it's been a frustrating experience; the folks who generate the data feed don't seem to think it's their problem to solve. :frowning_face:

warewolf avatar Mar 27 '18 13:03 warewolf

Just a comment: jq seems to handle this input without complaints.

choroba avatar Apr 02 '19 08:04 choroba

@choroba what version of JQ? Last I checked, JQ was still throwing errors on this sort of badly formed JSON.

warewolf avatar Apr 02 '19 13:04 warewolf

jq-1.5. It seems 1.6 should be around, too, so maybe it's different.

choroba avatar Apr 02 '19 13:04 choroba

Okay, so yes; JQ does parse the above example snippet of butchered UTF JSON; but I've got worse examples that JQ barfs on from the same data feed. Either way, YAML-PP is still the only way I can reliably parse this kind of JSON in perl, and I'm super happy that it still works.

warewolf avatar Apr 02 '19 13:04 warewolf

@warewolf That JSON is invalid due to unpaired half of surrogate pair. How would like you handle and decode invalid JSON? Such string does not have representation in UTF-8, so you cannot load & decode it. I see there two options: 1) Skip every non-parsable byte in input or 2) Replace non-parsable tokens in JSON string by Unicode replacement character. But both options changes input, so when processing it in Perl you would have something different.

I understand analytical reasons trying to process as many data as possible, but when on input are invalid data it is needed to specify how to non-reversible handle them.

pali avatar Jan 23 '20 15:01 pali

@pali well, because of how the JSON is already mangled (non UTF-8 interpreted as UTF-8, which gets completely fubar "this can't be represented in UTF-8") I honestly don't expect this to to be reversible to something consistent. For my use, the actual string values that are corrupted are irrelevant, the rest of the JSON structure I'm parsing does have value, so for me the important part is not bailing on parsing the entire JSON object.

Sadly I can't fix the origin data because it's from a commercial data feed, and apparently python gladly will serialize to invalid JSON?

warewolf avatar Jan 23 '20 16:01 warewolf

I can imagine that maintainer of Cpanel::JSON::XS could accept optional feature to process also invalid JSON strings and replace invalid characters by Unicode replacement character. So if you have really use cases (which seems that yes), open an issue/feature request for Cpanel::JSON::XS.

pali avatar Jan 23 '20 17:01 pali