[feature] support Unicode Transformation Format (UTF)
Yes, fq does support Unicode in some way, but Unicode is tricky. I'd like to transform between an array of bytes and an array of Unicode code points via UTF-8, UTF-16 (UTF-16LE and UTF-16BE), or UTF-32 (UTF-32LE and UTF-32BE), optionally with Byte Order Mark. Malformed UTF sequences should be reported as errors and/or replaced with the Unicode replacement character U+FFFD. On top, it should be possible to transform arrays of code points to arrays of characters via Unicode normalization. Only the latter is visible to a human reader.
An example:
0x| 65 cc 81 | bytes (UTF-8)
U+| 0065 0301 | code points
| e ◌́ | characters
| é | normalized characters (NFC)
Hey! do you image this would be done via a bunch of jq functions, maybe with argument for options etc? maybe some made up usage examples could be useful to see how it would look like?
fq already do have some UTF and text convert functions, see https://github.com/wader/fq/blob/master/format/text/encoding.jq they can be used to go from binary (raw bits or bytes), to utf8 (what fq/gojq uses for strings) and then to code points. Ex usage:
# convert jq string to some encoding, return a binary (raw bytes)
$ fq -cn '"ö" | to_utf16be'
│00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d│0123456789abcd│
0x0│00 f6│ │..│ │.: raw bits 0x0-0x2 (2)
# same but stdout is not a tty you get the raw output
$ fq -cn '"ö" | to_utf16be' | hexdump -C
00000000 00 f6 |..|
00000002
# use an array of numbers as binary (from_* will automatically convert to binary) and decode as utf16 be
$ fq -cn '[0x00, 0xf6] | from_utf16be'
"ö"
# to codepoints
$ fq -cn '[0x00, 0xf6] | from_utf16be | explode'
[246]
# back to string
$ fq -cn '[0x00, 0xf6] | from_utf16be | explode | implode'
"ö"
# to utf16 be, if be/le is skipped a BOM is added also
$ fq -cn '[0x00, 0xf6] | from_utf16be | explode | implode | to_utf16be'
│00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d│0123456789abcd│
0x0│00 f6│ │..│ │.: raw bits 0x0-0x2 (2)
Would normalized and other unicode operations be functions that work on code points arrays or strings? both?
Thanks for the examples. Looks like 90% of this issue is lack of documentation only. I've not found how to give errors on mal-formed UTF, e.g. add an option to let fq -cn '[0xc0] | from_utf8' raise an error because 0xc0 is invalid UTF-8.
One minor helpful addition might be an option to display unicode code points as U+1F600 instead of its numerical values.
Given a Unicode string encoded in UTF it would be useful to inspect the binary data. Given this:
fq -cn '"e\u0301"|to_utf16'
|00 01 02 03 04 05 06 07 08 09 0a 0b|0123456789ab|
0x0|ff fe 65 00 01 03| |..e...| |.: raw bits 0x0-0x6 (6)
How to get to something like this?
|00 01 02 03 04 05 06 07 08 09 0a 0b|0123456789ab|
0x0|ff fe | ....| | BOM (LE)
0x0| 65 00 |..e ..| | .[0] U+0065
0x0| 01 03 |....◌́ | | .[1] U+0301
Would normalized and other unicode operations be functions that work on code points arrays or strings? both?
Unicode normalization is not available in jq neither (see https://github.com/jqlang/jq/issues/2553) but there is a Go library. I think application to strings is enough given that jq/fq properly treats Unicode strings as sequence of code points (unlike for instance JavaScript or Python):
$ fq -cn '"\uD83D\uDE00"|length' # surrogate pair for U+1F600 (😀)
1
$ <<<'console.log("\uD83D\uDE00".length)' node
2
$ <<<'print(len("\uD83D\uDE00"))' python
2
There might be an easier way to convert between UTF-16 code units (two bytes) and code points:
$ fq -cn '[0xD83D,0xDE00]|map(tobytes|explode)|flatten|from_utf16be|explode'
[128512] # U+1F600
I will answer more in length tomorrow, got stuck coding the prototype and it got late :) it's in this branch https://github.com/wader/fq/tree/unicode-form
$ go run . -cn '[0x65, 0xcc, 0x81] | from_utf8 | ., _unicode_form({form: "nfc"}) | explode'
[101,769]
[233]
# currently only does utf8
$ go run . -cn '[0x65, 0xcc, 0x81] | utf'
│00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f│0123456789abcdef│.[0:2]: (utf)
0x0│65 │e │ [0]: "e" (101) (U+0065 LATIN SMALL LETTER E)
0x0│ cc 81│ │ ..│ │ [1]: "́" (769) (U+0301 COMBINING ACUTE ACCENT)
And yes sadly documentation is lacking quite a bit :( hope i can get some motivation to work on it.