ethereum-etl
ethereum-etl copied to clipboard
Filter out ASCII characters not supported by BigQuery
BigQuery fails when trying to load a CSV with ASCII 0 with the following message:
Error: Bad character (ASCII 0) encountered.
We need to check what other characters are not supported in BigQuery and filter them out https://en.wikipedia.org/wiki/ASCII.
https://github.com/medvedev1088/ethereum-etl/blob/master/ethereumetl/jobs/export_tokens_job.py#L64
This should probably a separate python script with filtering logic (not in export_tokens_job.py
).
Could this perhaps be an individual function inside the ethereumetl/utils.py file?
For example a clean_user_provided_content(content)
function which could be used by the export_tokens_job.py (and other scripts) via from ethereumetl.utils import clean_user_provided_content
.
Alternatively it might be cleaner to use the pre-existing library called Unidecode [1]. This way any .py file in the ETL application can just clean up the strings by importing Unidecode like this from unidecode import unidecode
and then using inline code like this clean_content = unidecode(str(dirty_content))
The only catch is that we will need to add pip3 install Unidecode
to the installer ;-)
[1] https://pypi.org/project/Unidecode/
BigQuery fails when trying to load a CSV with ASCII 0 with the following message:
Error: Bad character (ASCII 0) encountered.
We need to check what other characters are not supported in BigQuery and filter them out https://en.wikipedia.org/wiki/ASCII.
https://github.com/medvedev1088/ethereum-etl/blob/master/ethereumetl/jobs/export_tokens_job.py#L64
This should probably a separate python script with filtering logic (not in
export_tokens_job.py
).
Mister Medvedev, thank's a lot for the great decision (https://github.com/blockchain-etl/ethereum-etl/blob/develop/ethereumetl/jobs/export_tokens_job.py#L64). You saved my life. At least this evening.