zed Reading non-UTF-8 CSV (e.g., ISO-8859-1)

Reading non-UTF-8 CSV (e.g., ISO-8859-1)

Open philrz opened this issue 2 years ago • 1 comments

Repro is with Zed commit 963863b.

A community zync user originally surfaced this issue. The inquiry in their own words:

While doing some data exploration, I noticed that CSV file can be imported in non-UT8 encoding.

Should have been Béarn, Bécancour, etc. This particular CSV file was encoded in ISO-8859-1, commonly used in Latin language legacy apps.

They also proposed a possible solution approach:

It would be nice to have a decode function in Zed. Something like decode(myfield,"ISO-8859-1") which would convert the input string to UTF-8 and simply return it. Could it be implemented using a go library like https://pkg.go.dev/golang.org/x/text/encoding/charmap or similar?

This user was not able to share their original test data due to confidentiality reasons. However, they helpfully pointed to the site https://www.freeformatter.com/convert-file-encoding.html which is able to convert to the ISO-8859-1 encoding. They also pointed at some UTF-8-encoded test data at https://www.donneesquebec.ca/recherche/dataset/profil-financier-des-municipalites-locales-edition-2022/resource/bd45fe94-0c58-44e5-81e8-f854c18c7acb we could run through this converter. Therefore, for repro I've attached a CSV in UTF-8 followed by its equivalent converted to ISO-8859-1.

Format	File
UTF-8	pf-mun-2022-2022.csv
ISO-8859-1	iso-8859-1-pf-mun-2022-2022.csv

Here's screenshots of each file fully loaded into Zui Insiders, with the latter showing the problem.

And here's how each looks at the CLI.

$ zq -version
Version: v1.5.0

$ zq -z 'head 1 | cut nom_mun' pf-mun-2022-2022.csv
{nom_mun:"Îles-de-la-Madeleine"}

$ zq -z 'head 1 | cut nom_mun' iso-8859-1-pf-mun-2022-2022.csv 
{nom_mun:"\ufffdles-de-la-Madeleine"}

Some questions that come to mind when contemplating a solution:

Since the Zed data model speaks of UTF-8, should the tooling have been able to kick back with an error about the encoding rather than letting through garbled data?
The user's proposal of a function like decode(myfield,"ISO-8859-1") holds some appeal. However, @nwt pointed out that this kind of encoding is typically done on a whole-file basis. Therefore would it be appropriate to have some kind of flag in the reader that handles the entire CSV input with a specified encoding? (FWIW, for the user's original specific test data they no longer have the original CSV, so it's possible they may want the decode() function regardless to apply to older data that's already made it into Zed formats.)
While the user's request is specific to ISO-8859-1, would we want to add this functionality in a general way so it can be applied to any number of alternate encodings, such as the ones listed at https://pkg.go.dev/golang.org/x/text/encoding/charmap?

Jan 31 '23 00:01 philrz

zed zed copied to clipboard

Reading non-UTF-8 CSV (e.g., ISO-8859-1)

zed
zed copied to clipboard