miller icon indicating copy to clipboard operation
miller copied to clipboard

UTF 16 Plans

Open sakgoyal opened this issue 4 months ago • 5 comments

I know the docs say you have no plans for UTF-16 support. but that doc was from 4 years ago. Is there any more interest in UTF16 now? or is not supporting it still the case

sakgoyal avatar Aug 12 '25 22:08 sakgoyal

@sakgoyal in my experience, UTF-8 is adequate and seems to be quite standard. And Go (Miller's current implementation langauge) is solidly UTF-8 in terms of its built-in support -- although there is an add-on https://pkg.go.dev/unicode/utf16 package.

I'm not aware of UTF-16 demand -- can you tell me about your needs?

johnkerl avatar Aug 13 '25 13:08 johnkerl

looking back and trying to diagnose the issue, im no longer sure it is UTF16 that is the problem. I think miller is just handling UTF incorrectly somewhere.

what I am doing:

mlr --csv unsparsify *.csv >combined.csv

I am trying to merge a lot of csv files where the columns do not match

some of the input files contain the text: Türkiye when I run the command, miller outputs a file with this instead: T├╝rkiye

Info: Windows 10 64 bit Installed miller from winget mlr 6.13.0

sakgoyal avatar Aug 14 '25 22:08 sakgoyal

ok turns out the ü code point is in fact UTF 16. not UTF 8. Somehow when I was creating the CSVs, writing the ü worked normally even when I specified python to encode it as UTF8 (???).

I forced it to encode in UTF16 and used pandas to merge it instead of mlr, and that worked. sorry for the confusion.

but that still means I would not have been able to do this without utf16 support

sakgoyal avatar Aug 15 '25 17:08 sakgoyal

@sakgoyal Windows does indeed use UTF-16 and Miller does indeed not handle that so there is definitely room for improvement on the Miller side here ...

johnkerl avatar Aug 15 '25 23:08 johnkerl

You can workaround with something like iconv -f UTF16 -t UTF8 *.csv | mlr --csv unsparsify | iconv -f UTF8 -t UTF16 >combined.csv

braun2morrow avatar Nov 27 '25 11:11 braun2morrow