ethereum-etl Filter out ASCII characters not supported by BigQuery

Filter out ASCII characters not supported by BigQuery

Open medvedev1088 opened this issue 5 years ago • 3 comments

BigQuery fails when trying to load a CSV with ASCII 0 with the following message:

Error: Bad character (ASCII 0) encountered.

We need to check what other characters are not supported in BigQuery and filter them out https://en.wikipedia.org/wiki/ASCII.

https://github.com/medvedev1088/ethereum-etl/blob/master/ethereumetl/jobs/export_tokens_job.py#L64

This should probably a separate python script with filtering logic (not in export_tokens_job.py).

Aug 30 '18 08:08 medvedev1088

Could this perhaps be an individual function inside the ethereumetl/utils.py file?

For example a clean_user_provided_content(content) function which could be used by the export_tokens_job.py (and other scripts) via from ethereumetl.utils import clean_user_provided_content.

Mar 10 '19 11:03 tpmccallum

Alternatively it might be cleaner to use the pre-existing library called Unidecode [1]. This way any .py file in the ETL application can just clean up the strings by importing Unidecode like this from unidecode import unidecode and then using inline code like this clean_content = unidecode(str(dirty_content))

The only catch is that we will need to add pip3 install Unidecode to the installer ;-)

[1] https://pypi.org/project/Unidecode/

Mar 10 '19 11:03 tpmccallum

BigQuery fails when trying to load a CSV with ASCII 0 with the following message:

Error: Bad character (ASCII 0) encountered.

We need to check what other characters are not supported in BigQuery and filter them out https://en.wikipedia.org/wiki/ASCII.

https://github.com/medvedev1088/ethereum-etl/blob/master/ethereumetl/jobs/export_tokens_job.py#L64

This should probably a separate python script with filtering logic (not in export_tokens_job.py).

Mister Medvedev, thank's a lot for the great decision (https://github.com/blockchain-etl/ethereum-etl/blob/develop/ethereumetl/jobs/export_tokens_job.py#L64). You saved my life. At least this evening.

Aug 23 '21 19:08 DmitryShvetsov

ethereum-etl ethereum-etl copied to clipboard

Filter out ASCII characters not supported by BigQuery

ethereum-etl
ethereum-etl copied to clipboard