cite-classifications-wiki
cite-classifications-wiki copied to clipboard
Different charset encodings for parquet output columns
Hi @Harshdeep1996 ! I recently discovered some annoying problems in the output dataset, which is stored in the parquet format.
First problem
The 'metadata_file' column is stored as a byte
array instead of a proper Python unicode str
.
At least, this is what I get when importing that column.
Example: b"_1234.json"
instead of "_1234.json"
Second problem
In some columns (I'm not sure having identified all of them) strings are stored with unicode escape sequences inside.
What I mean is that the actual bytes that are stored persistently in the parquet file correspond to unicode string like this one 12\u201345
(where the \u2013
sequence is not interpreted as an em-dash char but as the string composed by the following chars: ['\', 'u', '2', '0', '1', '3']
). It should instead be stored as a proper Python unicode str
.
From my findings, these are the columns affected by this problem (but I'm not 100% sure, I need your help here):
- Authors
- Chapter
- Date
- ID_list
- Issue
- Pages
- Periodical
- PublisherName
- Title
- Volume
I currently assume every other column to be stored as a proper Python unicode str
.