jaq outputs invalid utf8 escaped json
I'm not sure if jaq is outputting invalid utf8, or if jq is too liberal in what it accepts. In any case this is easy to reproduce:
Steps:
- Have a json string with unicode codepoints
- Pass it through jaq
Expected:
- jaq outputs the unicode characters in json
Actual:
- jaq tries to escape the unicode characters
// foo.js
a = "\u{200b}banana \u{200e}man"
console.log(JSON.stringify({a}))
# this outputs the string with the unicode characters (but they're not visible)
❯ node foo.js
{"a":"banana man"}
# jaq tries to escape the unicode characters, but this is invalid json
❯ node foo.js | jaq
{
"a": "\u{200b}banana \u{200e}man"
}
# jq outputs the unicode characters (but they're not visible)
❯ node foo.js | jq
{
"a": "banana man"
}
# we can see jaq is producing invalid json
❯ node foo.js | jaq | jaq
Error: failed to parse: invalid hexadecimal sequence
# but jq produces valid json
❯ node foo.js | jq | jaq
{
"a": "\u{200b}banana \u{200e}man"
}
❯ node foo.js | jaq | jq
jq: parse error: Invalid characters in \uXXXX escape at line 2, column 35
This is to confirm that jaq (including jaq 2.0.0-alpha) is incorrect:
$ cat bananaman.json
"\u200bbanana \u200eman"
$ jq -r . bananaman.json | jq -R . | jaq .
"\u{200b}banana \u{200e}man"
Perhaps the intention was that jaq should write "\u200bbanana \u200eman", which would be reasonable although at variance with both the C and Go implementations:
$ jq -r . bananaman.json | jq -R . > bananaman.quoted.txt
$ jq . bananaman.quoted.txt
"banana man"
$ gojq . bananaman.quoted.txt
"banana man"
Thank you for your bug report, @ibash-corpusant! However, because your PR adds a new dependency, allocates a new string, and may fail, I corrected the problem in a way that avoids all that.
thank you for the quick fix!
I still have this problem on version 2.1.0, is that possible? Or am I doing anything wrong? The data is from the wiktionary data from https://kaikki.org/dictionary/rawdata.html (raw Wiktextract data)
$ jaq -c 'select((has("form_of") | not) and has("sounds")) | {(.word): (.sounds | map(select(has("ipa"))))}' ./raw-wiktextract-data.jsonl | rg '\\u\{'
{"विश\u{94d}व":[{"tags":["Delhi"],"ipa":"/ʋɪʃ.ʋə/"},{"tags":["Delhi"],"ipa":"[ʋɪʃ.ʋɐ]"}]}
{"विश\u{94d}व":[{"tags":["Delhi"],"ipa":"/ʋɪʃ.ʋə/"},{"tags":["Delhi"],"ipa":"[ʋɪʃ.ʋɐ]"}]}
{"विश\u{94d}व":[{"ipa":"/ʋiɕ.ʋə/"}]}
{"विश\u{94d}व":[{"ipa":"[bisːo]"}]}
{"विश\u{94d}व":[{"tags":["Vedic"],"ipa":"/ʋíɕ.ʋɐ/"},{"tags":["Classical-Sanskrit"],"ipa":"/ˈʋiɕ.ʋɐ/"}]}
{"विश\u{94d}व":[{"tags":["Vedic"],"ipa":"/ʋíɕ.ʋɐ/"},{"tags":["Classical-Sanskrit"],"ipa":"/ˈʋiɕ.ʋɐ/"}]}
{"क\u{941}त\u{94d}ता":[{"tags":["Delhi"],"ipa":"/kʊt̪.t̪ɑː/"},{"tags":["Delhi"],"ipa":"[kʊt̪̚.t̪äː]"}]}
{"अ\u{902}":[{"ipa":"/əŋ/"},{"ipa":"[aŋ]"}]}
{"अ\u{902}":[{"tags":["Delhi"],"ipa":"/ə̃/"},{"tags":["Delhi"],"ipa":"[ɐ̃]"}]}
{"अ\u{902}":[{"ipa":"/əm/"}]}
{"अ\u{902}":[{"ipa":"[ʌ̃]"},{"ipa":"[ʌm]"}]}
{"अ\u{902}":[{"ipa":"[ə̃ː]"}]}
{"अ\u{901}":[{"ipa":"/ə̃ː/"},{"ipa":"[ãː]"}]}
{"अ\u{901}":[{"tags":["Delhi"],"ipa":"/ə̃/"},{"tags":["Delhi"],"ipa":"[ɐ̃]"}]}
{"अ\u{901}":[{"ipa":"[ʌ̃]"}]}
{"அகரம\u{bcd}":[{"ipa":"/aɡaɾam/"}]}
{"அக\u{bcd}க\u{bbe}":[{"ipa":"/akːaː/"}]}
{"அகம\u{bcd}":[{"ipa":"/aɡam/"}]}
{"அங\u{bcd}கம\u{bcd}":[{"ipa":"/aŋɡam/"}]}
{"יי\u{5b4}דיש":[{"ipa":"/ˈjɪdɪʃ/"}]}
{"יי\u{5b4}דיש":[{"ipa":"/ˈjɪdɪʃ/"}]}
{"m\u{327}uļe":[{"note":"phonetic","ipa":"[mˠulʷe]"},{"note":"phonemic","ipa":"/mˠilʷej/"}]}
{"क\u{94d}या":[{"tags":["Delhi"],"ipa":"/kjɑː/"},{"tags":["Delhi"],"ipa":"[kjäː]"}]}
{"क\u{94d}या":[{"tags":["Delhi"],"ipa":"/kjɑː/"},{"tags":["Delhi"],"ipa":"[kjäː]"}]}
{"פ\u{5bf}ינף":[{"ipa":"/fɪnf/"},{"ipa":"/ˈfɪnəf/"}]}
{"मध\u{94d}य प\u{94d}रद\u{947}श":[{"tags":["Delhi"],"ipa":"/məd̪ʱ.jə .pɾə.d̪eːʃ/"},{"tags":["Delhi"],"ipa":"[mɐd̪ʱ.jɐ‿.pɾɐ.d̪eːʃ]"}]}
{"ฝร\u{e31}\u{e48}ง":[{"tags":["standard"],"ipa":"/fa˨˩.raŋ˨˩/"},{"tags":["standard"],"ipa":"/fa˦˥.raŋ˨˩/"}]}
{"ฝร\u{e31}\u{e48}ง":[{"tags":["standard"],"ipa":"/fa˨˩.raŋ˨˩/"},{"tags":["standard"],"ipa":"/fa˦˥.raŋ˨˩/"}]}
Hi @maribox, thanks for reporting this issue. You're not doing anything wrong, I just forgot originally to apply the Unicode formatting code from strings to object keys. #259 should correct this. Can you confirm that?
This should work now as expected.