cite-classifications-wiki Different charset encodings for parquet output columns

Different charset encodings for parquet output columns

Open iosonopersia opened this issue 3 years ago • 0 comments

Hi @Harshdeep1996 ! I recently discovered some annoying problems in the output dataset, which is stored in the parquet format.

First problem

The 'metadata_file' column is stored as a byte array instead of a proper Python unicode str. At least, this is what I get when importing that column. Example: b"_1234.json" instead of "_1234.json"

Second problem

In some columns (I'm not sure having identified all of them) strings are stored with unicode escape sequences inside. What I mean is that the actual bytes that are stored persistently in the parquet file correspond to unicode string like this one 12\u201345 (where the \u2013 sequence is not interpreted as an em-dash char but as the string composed by the following chars: ['\', 'u', '2', '0', '1', '3']). It should instead be stored as a proper Python unicode str.

From my findings, these are the columns affected by this problem (but I'm not 100% sure, I need your help here):

Authors
Chapter
Date
ID_list
Issue
Pages
Periodical
PublisherName
Title
Volume

I currently assume every other column to be stored as a proper Python unicode str.

Mar 03 '21 09:03 iosonopersia

cite-classifications-wiki cite-classifications-wiki copied to clipboard

Different charset encodings for parquet output columns

First problem

Second problem

cite-classifications-wiki
cite-classifications-wiki copied to clipboard