grist-core icon indicating copy to clipboard operation
grist-core copied to clipboard

Support importing UTF-8 CSV files

Open samkhal opened this issue 1 year ago β€’ 7 comments

Grist doesn't handle UTF-8 CSV files properly.

Minimal example

  • Copy the following into Google sheets:
Type,Status,Note
Payment,Complete,πŸ’°
  • File->Download->CSV. You get a UTF-8 csv file.
  • Import it into Grist. You get:
Type,Status,Note
Payment,Complete,ΔŸΕΈβ€™Β°ΔŸΕΈΛœΕ“
  • Export it as CSV. You get the same mangled result, encoded as UTF-8. The CSV import code seems to assume Latin-1 encoding.

Expectation

Grist should either correctly guess the UTF-8 encoding, or allow the user to select the desired encoding from a dropdown (or both).

This is quite important for multi-language support, but my personal use-case is importing CSV files from Venmo, where it is standard to use emojis, sometimes exclusively, in descriptions of transactions.

samkhal avatar Aug 18 '23 16:08 samkhal

I've run into this as well and would love to see this fixed, ideally with guessing the encoding. πŸ‘

anaisconce avatar Aug 18 '23 17:08 anaisconce

Grist uses chardet to guess the encoding:

import chardet

s = """Type,Status,Note
Payment,Complete,πŸ’°
""".encode('utf-8')

print(chardet.detect_all(s))
[{'encoding': 'Windows-1254', 'confidence': 0.5465407561688055, 'language': 'Turkish'},
 {'encoding': 'Windows-1252', 'confidence': 0.33692307692307694, 'language': ''}]

Unfortunately there's no way to guess the encoding perfectly.

or allow the user to select the desired encoding from a dropdown

This should be quite doable.

alexmojaki avatar Aug 18 '23 17:08 alexmojaki

Actually, I have a fix in hand and almost ready. There is a genuine bug that's causing wrong guesses in many cases.

dsagal avatar Aug 18 '23 18:08 dsagal

Great to hear @dsagal.

In general, Grist should be able to round-trip to and from CSV without data loss. UTF-8 seems like a better default unless there's an indication that the encoding is something else.

samkhal avatar Aug 20 '23 18:08 samkhal

Hi @dsagal, any chance you have an update on this?

Edit: Oh, looks like you can now set the encoding to "utf-8" manually in "import options"! That's great. Still seems to be a bug that it usually seems to guess the wrong encoding.

samkhal avatar Jan 07 '24 02:01 samkhal

Actually, a fix landed a while ago, in d5a4605d2a3d04e0e87ede334d9b9d0e54b13f08. I am sorry that I failed to follow up on this issue to say so.

There are two parts to the fix. One is that it fixes the real bug I knew about, which used to happen for larger files. For the small example you shared, it still looks too much like Turkish text to the Python chardet module, so that's still the first guess.

The second part of the fix is that the Import Options now allow you to change the encoding, e.g. to utf-8:

Screenshot 2024-01-07 at 12 06 21 AM Screenshot 2024-01-07 at 12 07 29 AM

Let me know please if this works for you. If so, perhaps this issue can be closed.

dsagal avatar Jan 07 '24 05:01 dsagal

I don't know how chardet makes its guesses, but I can say that I see Grist consistently picking the wrong encoding for emoji-laden utf-8.

In practice I often see it picking little-used encodings like Windows-1254 or macroman. E.g. this UTF-8 csv file causes it to pick the former, then spit out the error message Using encoding Windows-1254, encountered 2 errors. Use Import Options to change:

Description_
🚌😒. (πŸ˜›)
πŸ’° πŸ”™ (splitwise.com)
πŸ’΅ πŸ’Έ (splitwise.com)
πŸ•

The same file opened in VSCode, with files.autoGuessEncoding set to true, correctly guesses utf-8.

One helpful change would be to have Grist persist the encoding selection rather than forcing the user to change it on every import. Users are probably not often going to be mixing encodings. In truth, I think just defaulting to utf-8, by far the most popular encoding, would work better for most users than trying to guess. The few who want other encodings can set the import option.

samkhal avatar Jan 07 '24 07:01 samkhal