xsv icon indicating copy to clipboard operation
xsv copied to clipboard

Clarify support (or not) for character encodings other than UTF-8

Open sacundim opened this issue 9 years ago • 6 comments

The documentation in the README.md doesn't explain what is xsv's support or policy for character encodings. I think it really ought to.

Looking through the code for xsv and the csv crate, it looks like there isn't a consistent policy:

  1. Most of the code reads rows with the byte_records() function.
  2. xsv search, however, uses the records() function, which interprets the data as UTF-8.
  3. There are a few places where the code calls str::from_utf8() on byte data.
  4. The select module uses String to represent field names, which is UTF-8. What happens when you try to xsv select from a file that has Latin-1 field names?

sacundim avatar Sep 15 '16 09:09 sacundim

I think my intention was to support text encodings that are "ASCII compatible," which should include Latin-1. For example, in almost all cases from str::from_utf8 is used, there is an actual fallback that runs with just the raw bytes. So there shouldn't be any places where, say, a true latin-1 encoding would be a problem.

Of course, you did pick out a few! In particular:

  1. Field name selection does appear to be limited to utf-8. Fixing that probably means moving the parser to &[u8] instead of &str.
  2. Searching via regex required &str at the time I wrote the code, but we can switch to byte based regexes. (The search pattern must still be UTF-8, but, one can search for arbitrary bytes with hex escapes. That isn't particularly ideal, but does make latin-1 support possible...)

BurntSushi avatar Sep 15 '16 15:09 BurntSushi

IMO xsv should be UTF-8 first:

  • supporting other charsets is not really a requirement as you can always convert from anything else to unicode very quickly... but the reverse is not true
  • all the rust ecosystem is very UTF8-centric for good reasons, and the performance of UTF8 regexes is stellar as you very well know ;-)
  • latin1 is dying at least on the web

but I guess you had specific motivations for latin1 support?

eddy-geek avatar Dec 12 '16 15:12 eddy-geek

@eddy-geek I don't really understand what's motivating your comment. CSV itself doesn't have a specified character encoding, and most CSV parsers are written to be ASCII compatible. ASCII compatibility is the goal, and as a result, encodings like latin-1 wind up being supported. This is important because CSV data is often quite messy, and there's nothing worse than failing to read CSV data because of a character encoding issue.

This issue is basically "fix a few places in xsv where UTF-8 is assumed." That's it. Nothing more.

BurntSushi avatar Dec 12 '16 16:12 BurntSushi

Ok I see, sorry for the noise

On 12 Dec 2016 5:56 pm, "Andrew Gallant" [email protected] wrote:

@eddy-geek https://github.com/eddy-geek I don't really understand what's motivating your comment. CSV itself doesn't have a specified character encoding, and most CSV parsers are written to be ASCII compatible. ASCII compatibility is the goal, and as a result, encodings like latin-1 wind up being supported. This is important because CSV data is often quite messy, and there's nothing worse than failing to read CSV data because of a character encoding issue.

This issue is basically "fix a few places in xsv where UTF-8 is assumed." That's it. Nothing more.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BurntSushi/xsv/issues/42#issuecomment-266485403, or mute the thread https://github.com/notifications/unsubscribe-auth/ACpOGRdfgX_9dVE-Ti8T43HGpGKtXvUsks5rHXy4gaJpZM4J9qb5 .

eddy-geek avatar Dec 12 '16 17:12 eddy-geek

What about UTF-16, UTF-16BE, UTF-16LE ?

velocityzen avatar Mar 26 '24 21:03 velocityzen

Not supported.

BurntSushi avatar Mar 26 '24 23:03 BurntSushi