python-bigquery-pandas icon indicating copy to clipboard operation
python-bigquery-pandas copied to clipboard

ArrowTypeError: Expected a string or bytes dtype, got uint8 when running to_gbq with uint8

Open wnojopra opened this issue 2 years ago • 4 comments

Environment details

  • OS type and version: Ubuntu 20.04.3 LTS
  • Python version: 3.7.12
  • pip version: 22.3.1
  • pandas-gbq version: 0.17.9

Steps to reproduce

  1. Create a dataframe that has a column of dtype uint8 (the default type that gets output by pandas.get_dummies, for example)
  2. Execute to_gbq on that dataframe and notice ArrowTypeError: Expected a string or bytes dtype, got uint8

Code example

my_df = pd.DataFrame({'col': [0, 1]}, dtype="uint8")
my_df.to_gbq(FULL_BQ_NAME, project_id=GOOGLE_PROJECT, if_exists = 'replace')

Stack trace

/opt/conda/lib/python3.7/site-packages/google/cloud/bigquery/_pandas_helpers.py in bq_to_arrow_array(series, bq_field)
    288     if field_type_upper in schema._STRUCT_TYPES:
    289         return pyarrow.StructArray.from_pandas(series, type=arrow_type)
--> 290     return pyarrow.Array.from_pandas(series, type=arrow_type)
    291 
    292 

/opt/conda/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.Array.from_pandas()

/opt/conda/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/conda/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowTypeError: Expected a string or bytes dtype, got uint8

wnojopra avatar Mar 02 '23 19:03 wnojopra

Thanks for the report! Thankfully uint8 fits inside int64, so it seems we should be using BigQuery INT64 columns for these types.

tswast avatar Mar 28 '23 19:03 tswast

similar kind of issue Expected bytes, got a 'int' object

anujsh61 avatar Aug 28 '23 19:08 anujsh61

@anujsh61 Can you confirm if you're creating a new table or writing to one that already exists?

tswast avatar Nov 20 '23 14:11 tswast

I think the fix for this issue needs to happen here: https://github.com/googleapis/python-bigquery-pandas/blob/d9211069f3f744d75178c102757ea519185dbcff/pandas_gbq/schema.py#L108

I suspect uint8 is hitting out "string" fallback dtype.

Aside: I see we always are hitting the "table already exists" case in the google-cloud-bigquery library. Now that we're using BQ Load jobs, I think we can try removing all of our type inference logic from this library as well as the following logic to solve this issue:

https://github.com/googleapis/python-bigquery-pandas/blob/d9211069f3f744d75178c102757ea519185dbcff/pandas_gbq/gbq.py#L1350-L1359

and

https://github.com/googleapis/python-bigquery-pandas/blob/d9211069f3f744d75178c102757ea519185dbcff/pandas_gbq/gbq.py#L1197-L1218

tswast avatar Nov 20 '23 14:11 tswast