astro-sdk Issue in `load_file` some datasets in Snowflake

Version: astro==0.4.0

Problem At the moment, we are unable to load the following dataset from Tate Gallery into Snowflake: https://github.com/tategallery/collection/blob/master/artwork_data.csv. The operation works using BQ and Postgres. I could not find any particular issue with the original dataset.

Exception:

  File "/home/tati/Code/astro-fresh/src/astro/sql/operators/agnostic_load_file.py", line 80, in execute
    move_dataframe_to_sql(
  File "/home/tati/Code/astro-fresh/src/astro/utils/load_dataframe.py", line 72, in move_dataframe_to_sql
    write_pandas(
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/snowflake/connector/pandas_tools.py", line 146, in write_pandas
    create_stage_sql = (
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/util/_decorators.py", line 207, in wrapper
    return func(*args, **kwargs)
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/core/frame.py", line 2677, in to_parquet
    return to_parquet(
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/io/parquet.py", line 416, in to_parquet
    impl.write(
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/io/parquet.py", line 173, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 1561, in pyarrow.lib.Table.from_pandas
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays
    arrays = [convert_column(c, f)
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 594, in <listcomp>
    arrays = [convert_column(c, f)
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column
    raise e
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ("Could not convert '99' with type str: tried to convert to double", 'Conversion failed for column HEIGHT with type object')

How to reproduce

Download the dataset artwork_data.csv.

Update the tests/benchmark/config.json file to include a dataset similar to:

    {
        "name": "few_mb",
        "size": "24M",
        "path": "<path-to>/artwork_data.csv",
        "rows": 69201
    },

And a database that uses Snowflake.

From within the tests/benchmark folder, run:

/run.py --dataset=few_mb --database=snowflake

Initial analysis

The first step of load_file is to load the CSV to a Pandas data frame; In the case of this particular dataset, Pandas automagically assigns the following types per column:

(Pdb) df. types
id                      int64
accession_number       object
artist                 object
artistRole             object
artistId                int64
title                  object
dateText               object
medium                 object
creditLine             object
year                   object
acquisitionYear       float64
dimensions             object
width                  object
height                 object
depth                 float64
units                  object
inscription            object
thumbnailCopyright     object
thumbnailUrl           object
url                    object
dtype: object

When analyzing the values within height, it is possible to see that there is a mixture of strings, floats, and nan:

(Pdb) len([i for i in df.height if isinstance(i, str)])
31330
(Pdb) len([i for i in df.height if not isinstance(i, str)])
37871

Why doesn't this happen for BQ & Postgres?

Because they are currently using a different strategy to write from the data frame into the table in the database: https://github.com/astro-projects/astro/blob/4e63302bc5c69401b10568598c4ff738e21563f5/src/astro/utils/load_dataframe.py#L60-L95

Feb 11 '22 23:02 tatiana

I believe that the solution is to add a feature where a user can manually define a schema using a list of SQLAlchemy columns. I believe @utkarsharma2 had looked into this at some point.

Feb 24 '22 16:02 dimberman

This problem is related to the incorrect type identification by Pandas.

We have given a solution to this problem by allowing users to specify the columns as part of the Table definition, introduced in 0.9: https://github.com/astronomer/astro-sdk/blob/main/python-sdk/docs/CHANGELOG.md#090

Jan 17 '23 09:01 tatiana

astro-sdk astro-sdk copied to clipboard

Issue in `load_file` some datasets in Snowflake

astro-sdk
astro-sdk copied to clipboard