Clarify support (or not) for character encodings other than UTF-8
The documentation in the README.md doesn't explain what is xsv's support or policy for character encodings. I think it really ought to.
Looking through the code for xsv and the csv crate, it looks like there isn't a consistent policy:
- Most of the code reads rows with the
byte_records()function. -
xsv search, however, uses therecords()function, which interprets the data as UTF-8. - There are a few places where the code calls
str::from_utf8()on byte data. -
The
selectmodule usesStringto represent field names, which is UTF-8. What happens when you try toxsv selectfrom a file that has Latin-1 field names?
I think my intention was to support text encodings that are "ASCII compatible," which should include Latin-1. For example, in almost all cases from str::from_utf8 is used, there is an actual fallback that runs with just the raw bytes. So there shouldn't be any places where, say, a true latin-1 encoding would be a problem.
Of course, you did pick out a few! In particular:
- Field name selection does appear to be limited to utf-8. Fixing that probably means moving the parser to
&[u8]instead of&str. - Searching via regex required
&strat the time I wrote the code, but we can switch to byte based regexes. (The search pattern must still be UTF-8, but, one can search for arbitrary bytes with hex escapes. That isn't particularly ideal, but does make latin-1 support possible...)
IMO xsv should be UTF-8 first:
- supporting other charsets is not really a requirement as you can always convert from anything else to unicode very quickly... but the reverse is not true
- all the rust ecosystem is very UTF8-centric for good reasons, and the performance of UTF8 regexes is stellar as you very well know ;-)
- latin1 is dying at least on the web
but I guess you had specific motivations for latin1 support?
@eddy-geek I don't really understand what's motivating your comment. CSV itself doesn't have a specified character encoding, and most CSV parsers are written to be ASCII compatible. ASCII compatibility is the goal, and as a result, encodings like latin-1 wind up being supported. This is important because CSV data is often quite messy, and there's nothing worse than failing to read CSV data because of a character encoding issue.
This issue is basically "fix a few places in xsv where UTF-8 is assumed." That's it. Nothing more.
Ok I see, sorry for the noise
On 12 Dec 2016 5:56 pm, "Andrew Gallant" [email protected] wrote:
@eddy-geek https://github.com/eddy-geek I don't really understand what's motivating your comment. CSV itself doesn't have a specified character encoding, and most CSV parsers are written to be ASCII compatible. ASCII compatibility is the goal, and as a result, encodings like latin-1 wind up being supported. This is important because CSV data is often quite messy, and there's nothing worse than failing to read CSV data because of a character encoding issue.
This issue is basically "fix a few places in xsv where UTF-8 is assumed." That's it. Nothing more.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BurntSushi/xsv/issues/42#issuecomment-266485403, or mute the thread https://github.com/notifications/unsubscribe-auth/ACpOGRdfgX_9dVE-Ti8T43HGpGKtXvUsks5rHXy4gaJpZM4J9qb5 .
What about UTF-16, UTF-16BE, UTF-16LE ?
Not supported.