grist-core
grist-core copied to clipboard
Support importing UTF-8 CSV files
Grist doesn't handle UTF-8 CSV files properly.
Minimal example
- Copy the following into Google sheets:
Type,Status,Note
Payment,Complete,π°
- File->Download->CSV. You get a UTF-8 csv file.
- Import it into Grist. You get:
Type,Status,Note
Payment,Complete,ΔΕΈβΒ°ΔΕΈΛΕ
- Export it as CSV. You get the same mangled result, encoded as UTF-8. The CSV import code seems to assume Latin-1 encoding.
Expectation
Grist should either correctly guess the UTF-8 encoding, or allow the user to select the desired encoding from a dropdown (or both).
This is quite important for multi-language support, but my personal use-case is importing CSV files from Venmo, where it is standard to use emojis, sometimes exclusively, in descriptions of transactions.
I've run into this as well and would love to see this fixed, ideally with guessing the encoding. π
Grist uses chardet to guess the encoding:
import chardet
s = """Type,Status,Note
Payment,Complete,π°
""".encode('utf-8')
print(chardet.detect_all(s))
[{'encoding': 'Windows-1254', 'confidence': 0.5465407561688055, 'language': 'Turkish'},
{'encoding': 'Windows-1252', 'confidence': 0.33692307692307694, 'language': ''}]
Unfortunately there's no way to guess the encoding perfectly.
or allow the user to select the desired encoding from a dropdown
This should be quite doable.
Actually, I have a fix in hand and almost ready. There is a genuine bug that's causing wrong guesses in many cases.
Great to hear @dsagal.
In general, Grist should be able to round-trip to and from CSV without data loss. UTF-8 seems like a better default unless there's an indication that the encoding is something else.
Hi @dsagal, any chance you have an update on this?
Edit: Oh, looks like you can now set the encoding to "utf-8" manually in "import options"! That's great. Still seems to be a bug that it usually seems to guess the wrong encoding.
Actually, a fix landed a while ago, in d5a4605d2a3d04e0e87ede334d9b9d0e54b13f08. I am sorry that I failed to follow up on this issue to say so.
There are two parts to the fix. One is that it fixes the real bug I knew about, which used to happen for larger files. For the small example you shared, it still looks too much like Turkish text to the Python chardet
module, so that's still the first guess.
The second part of the fix is that the Import Options
now allow you to change the encoding, e.g. to utf-8
:
Let me know please if this works for you. If so, perhaps this issue can be closed.
I don't know how chardet
makes its guesses, but I can say that I see Grist consistently picking the wrong encoding for emoji-laden utf-8.
In practice I often see it picking little-used encodings like Windows-1254 or macroman. E.g. this UTF-8 csv file causes it to pick the former, then spit out the error message Using encoding Windows-1254, encountered 2 errors. Use Import Options to change
:
Description_
ππ’. (π)
π° π (splitwise.com)
π΅ πΈ (splitwise.com)
π
The same file opened in VSCode, with files.autoGuessEncoding
set to true, correctly guesses utf-8.
One helpful change would be to have Grist persist the encoding selection rather than forcing the user to change it on every import. Users are probably not often going to be mixing encodings. In truth, I think just defaulting to utf-8, by far the most popular encoding, would work better for most users than trying to guess. The few who want other encodings can set the import option.