UTF 16 Plans
I know the docs say you have no plans for UTF-16 support. but that doc was from 4 years ago. Is there any more interest in UTF16 now? or is not supporting it still the case
@sakgoyal in my experience, UTF-8 is adequate and seems to be quite standard. And Go (Miller's current implementation langauge) is solidly UTF-8 in terms of its built-in support -- although there is an add-on https://pkg.go.dev/unicode/utf16 package.
I'm not aware of UTF-16 demand -- can you tell me about your needs?
looking back and trying to diagnose the issue, im no longer sure it is UTF16 that is the problem. I think miller is just handling UTF incorrectly somewhere.
what I am doing:
mlr --csv unsparsify *.csv >combined.csv
I am trying to merge a lot of csv files where the columns do not match
some of the input files contain the text: Türkiye
when I run the command, miller outputs a file with this instead: T├╝rkiye
Info: Windows 10 64 bit Installed miller from winget mlr 6.13.0
ok turns out the ü code point is in fact UTF 16. not UTF 8. Somehow when I was creating the CSVs, writing the ü worked normally even when I specified python to encode it as UTF8 (???).
I forced it to encode in UTF16 and used pandas to merge it instead of mlr, and that worked. sorry for the confusion.
but that still means I would not have been able to do this without utf16 support
@sakgoyal Windows does indeed use UTF-16 and Miller does indeed not handle that so there is definitely room for improvement on the Miller side here ...
You can workaround with something like
iconv -f UTF16 -t UTF8 *.csv | mlr --csv unsparsify | iconv -f UTF8 -t UTF16 >combined.csv