astro-sdk
astro-sdk copied to clipboard
Issue in `load_file` some datasets in Snowflake
Version: astro==0.4.0
Problem At the moment, we are unable to load the following dataset from Tate Gallery into Snowflake: https://github.com/tategallery/collection/blob/master/artwork_data.csv. The operation works using BQ and Postgres. I could not find any particular issue with the original dataset.
Exception:
File "/home/tati/Code/astro-fresh/src/astro/sql/operators/agnostic_load_file.py", line 80, in execute
move_dataframe_to_sql(
File "/home/tati/Code/astro-fresh/src/astro/utils/load_dataframe.py", line 72, in move_dataframe_to_sql
write_pandas(
File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/snowflake/connector/pandas_tools.py", line 146, in write_pandas
create_stage_sql = (
File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/util/_decorators.py", line 207, in wrapper
return func(*args, **kwargs)
File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/core/frame.py", line 2677, in to_parquet
return to_parquet(
File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/io/parquet.py", line 416, in to_parquet
impl.write(
File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/io/parquet.py", line 173, in write
table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
File "pyarrow/table.pxi", line 1561, in pyarrow.lib.Table.from_pandas
File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays
arrays = [convert_column(c, f)
File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 594, in <listcomp>
arrays = [convert_column(c, f)
File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column
raise e
File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ("Could not convert '99' with type str: tried to convert to double", 'Conversion failed for column HEIGHT with type object')
How to reproduce
Download the dataset artwork_data.csv.
Update the tests/benchmark/config.json
file to include a dataset similar to:
{
"name": "few_mb",
"size": "24M",
"path": "<path-to>/artwork_data.csv",
"rows": 69201
},
And a database that uses Snowflake.
From within the tests/benchmark
folder, run:
/run.py --dataset=few_mb --database=snowflake
Initial analysis
The first step of load_file
is to load the CSV to a Pandas data frame; In the case of this particular dataset, Pandas automagically assigns the following types per column:
(Pdb) df. types
id int64
accession_number object
artist object
artistRole object
artistId int64
title object
dateText object
medium object
creditLine object
year object
acquisitionYear float64
dimensions object
width object
height object
depth float64
units object
inscription object
thumbnailCopyright object
thumbnailUrl object
url object
dtype: object
When analyzing the values within height, it is possible to see that there is a mixture of strings, floats, and nan:
(Pdb) len([i for i in df.height if isinstance(i, str)])
31330
(Pdb) len([i for i in df.height if not isinstance(i, str)])
37871
Why doesn't this happen for BQ & Postgres?
Because they are currently using a different strategy to write from the data frame into the table in the database: https://github.com/astro-projects/astro/blob/4e63302bc5c69401b10568598c4ff738e21563f5/src/astro/utils/load_dataframe.py#L60-L95
I believe that the solution is to add a feature where a user can manually define a schema using a list of SQLAlchemy columns. I believe @utkarsharma2 had looked into this at some point.
This problem is related to the incorrect type identification by Pandas.
We have given a solution to this problem by allowing users to specify the columns
as part of the Table
definition, introduced in 0.9:
https://github.com/astronomer/astro-sdk/blob/main/python-sdk/docs/CHANGELOG.md#090