jaq icon indicating copy to clipboard operation
jaq copied to clipboard

jaq outputs invalid utf8 escaped json

Open ibash-corpusant opened this issue 1 year ago • 5 comments

I'm not sure if jaq is outputting invalid utf8, or if jq is too liberal in what it accepts. In any case this is easy to reproduce:

Steps:

  1. Have a json string with unicode codepoints
  2. Pass it through jaq

Expected:

  1. jaq outputs the unicode characters in json

Actual:

  1. jaq tries to escape the unicode characters
// foo.js
a = "\u{200b}banana \u{200e}man"
console.log(JSON.stringify({a}))
# this outputs the string with the unicode characters (but they're not visible)
❯ node foo.js 
{"a":"​banana ‎man"}

# jaq tries to escape the unicode characters, but this is invalid json
❯ node foo.js | jaq
{
  "a": "\u{200b}banana \u{200e}man"
}

# jq outputs the unicode characters (but they're not visible)
❯ node foo.js | jq
{
  "a": "​banana ‎man"
}

# we can see jaq is producing invalid json
❯ node foo.js | jaq | jaq
Error: failed to parse: invalid hexadecimal sequence

# but jq produces valid json
❯ node foo.js | jq | jaq 
{
  "a": "\u{200b}banana \u{200e}man"
}

❯ node foo.js | jaq | jq 
jq: parse error: Invalid characters in \uXXXX escape at line 2, column 35

ibash-corpusant avatar Sep 09 '24 23:09 ibash-corpusant

This is to confirm that jaq (including jaq 2.0.0-alpha) is incorrect:

$ cat bananaman.json
"\u200bbanana \u200eman"

$ jq -r . bananaman.json | jq -R . | jaq . 
"\u{200b}banana \u{200e}man"

Perhaps the intention was that jaq should write "\u200bbanana \u200eman", which would be reasonable although at variance with both the C and Go implementations:

$ jq -r . bananaman.json | jq -R . > bananaman.quoted.txt
$ jq . bananaman.quoted.txt
"​banana ‎man"
$ gojq . bananaman.quoted.txt
"​banana ‎man"

pkoppstein avatar Sep 10 '24 03:09 pkoppstein

Thank you for your bug report, @ibash-corpusant! However, because your PR adds a new dependency, allocates a new string, and may fail, I corrected the problem in a way that avoids all that.

01mf02 avatar Sep 10 '24 10:09 01mf02

thank you for the quick fix!

ibash-corpusant avatar Sep 10 '24 17:09 ibash-corpusant

I still have this problem on version 2.1.0, is that possible? Or am I doing anything wrong? The data is from the wiktionary data from https://kaikki.org/dictionary/rawdata.html (raw Wiktextract data)

$ jaq -c 'select((has("form_of") | not) and has("sounds")) | {(.word): (.sounds | map(select(has("ipa"))))}' ./raw-wiktextract-data.jsonl | rg '\\u\{' 
{"विश\u{94d}व":[{"tags":["Delhi"],"ipa":"/ʋɪʃ.ʋə/"},{"tags":["Delhi"],"ipa":"[ʋɪʃ.ʋɐ]"}]}
{"विश\u{94d}व":[{"tags":["Delhi"],"ipa":"/ʋɪʃ.ʋə/"},{"tags":["Delhi"],"ipa":"[ʋɪʃ.ʋɐ]"}]}
{"विश\u{94d}व":[{"ipa":"/ʋiɕ.ʋə/"}]}
{"विश\u{94d}व":[{"ipa":"[bisːo]"}]}
{"विश\u{94d}व":[{"tags":["Vedic"],"ipa":"/ʋíɕ.ʋɐ/"},{"tags":["Classical-Sanskrit"],"ipa":"/ˈʋiɕ.ʋɐ/"}]}
{"विश\u{94d}व":[{"tags":["Vedic"],"ipa":"/ʋíɕ.ʋɐ/"},{"tags":["Classical-Sanskrit"],"ipa":"/ˈʋiɕ.ʋɐ/"}]}
{"क\u{941}त\u{94d}ता":[{"tags":["Delhi"],"ipa":"/kʊt̪.t̪ɑː/"},{"tags":["Delhi"],"ipa":"[kʊt̪̚.t̪äː]"}]}
{"अ\u{902}":[{"ipa":"/əŋ/"},{"ipa":"[aŋ]"}]}
{"अ\u{902}":[{"tags":["Delhi"],"ipa":"/ə̃/"},{"tags":["Delhi"],"ipa":"[ɐ̃]"}]}
{"अ\u{902}":[{"ipa":"/əm/"}]}
{"अ\u{902}":[{"ipa":"[ʌ̃]"},{"ipa":"[ʌm]"}]}
{"अ\u{902}":[{"ipa":"[ə̃ː]"}]}
{"अ\u{901}":[{"ipa":"/ə̃ː/"},{"ipa":"[ãː]"}]}
{"अ\u{901}":[{"tags":["Delhi"],"ipa":"/ə̃/"},{"tags":["Delhi"],"ipa":"[ɐ̃]"}]}
{"अ\u{901}":[{"ipa":"[ʌ̃]"}]}
{"அகரம\u{bcd}":[{"ipa":"/aɡaɾam/"}]}
{"அக\u{bcd}க\u{bbe}":[{"ipa":"/akːaː/"}]}
{"அகம\u{bcd}":[{"ipa":"/aɡam/"}]}
{"அங\u{bcd}கம\u{bcd}":[{"ipa":"/aŋɡam/"}]}
{"יי\u{5b4}דיש":[{"ipa":"/ˈjɪdɪʃ/"}]}
{"יי\u{5b4}דיש":[{"ipa":"/ˈjɪdɪʃ/"}]}
{"m\u{327}uļe":[{"note":"phonetic","ipa":"[mˠulʷe]"},{"note":"phonemic","ipa":"/mˠilʷej/"}]}
{"क\u{94d}या":[{"tags":["Delhi"],"ipa":"/kjɑː/"},{"tags":["Delhi"],"ipa":"[kjäː]"}]}
{"क\u{94d}या":[{"tags":["Delhi"],"ipa":"/kjɑː/"},{"tags":["Delhi"],"ipa":"[kjäː]"}]}
{"פ\u{5bf}ינף":[{"ipa":"/fɪnf/"},{"ipa":"/ˈfɪnəf/"}]}
{"मध\u{94d}य प\u{94d}रद\u{947}श":[{"tags":["Delhi"],"ipa":"/məd̪ʱ.jə .pɾə.d̪eːʃ/"},{"tags":["Delhi"],"ipa":"[mɐd̪ʱ.jɐ‿.pɾɐ.d̪eːʃ]"}]}
{"ฝร\u{e31}\u{e48}ง":[{"tags":["standard"],"ipa":"/fa˨˩.raŋ˨˩/"},{"tags":["standard"],"ipa":"/fa˦˥.raŋ˨˩/"}]}
{"ฝร\u{e31}\u{e48}ง":[{"tags":["standard"],"ipa":"/fa˨˩.raŋ˨˩/"},{"tags":["standard"],"ipa":"/fa˦˥.raŋ˨˩/"}]}

maribox avatar Feb 08 '25 13:02 maribox

Hi @maribox, thanks for reporting this issue. You're not doing anything wrong, I just forgot originally to apply the Unicode formatting code from strings to object keys. #259 should correct this. Can you confirm that?

01mf02 avatar Feb 14 '25 18:02 01mf02

This should work now as expected.

01mf02 avatar Sep 25 '25 05:09 01mf02