python-bigquery-pandas
python-bigquery-pandas copied to clipboard
ArrowTypeError: Expected a string or bytes dtype, got uint8 when running to_gbq with uint8
Environment details
- OS type and version: Ubuntu 20.04.3 LTS
- Python version: 3.7.12
- pip version: 22.3.1
pandas-gbqversion: 0.17.9
Steps to reproduce
- Create a dataframe that has a column of dtype
uint8(the default type that gets output by pandas.get_dummies, for example) - Execute to_gbq on that dataframe and notice
ArrowTypeError: Expected a string or bytes dtype, got uint8
Code example
my_df = pd.DataFrame({'col': [0, 1]}, dtype="uint8")
my_df.to_gbq(FULL_BQ_NAME, project_id=GOOGLE_PROJECT, if_exists = 'replace')
Stack trace
/opt/conda/lib/python3.7/site-packages/google/cloud/bigquery/_pandas_helpers.py in bq_to_arrow_array(series, bq_field)
288 if field_type_upper in schema._STRUCT_TYPES:
289 return pyarrow.StructArray.from_pandas(series, type=arrow_type)
--> 290 return pyarrow.Array.from_pandas(series, type=arrow_type)
291
292
/opt/conda/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.Array.from_pandas()
/opt/conda/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
/opt/conda/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()
/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowTypeError: Expected a string or bytes dtype, got uint8
Thanks for the report! Thankfully uint8 fits inside int64, so it seems we should be using BigQuery INT64 columns for these types.
similar kind of issue Expected bytes, got a 'int' object
@anujsh61 Can you confirm if you're creating a new table or writing to one that already exists?
I think the fix for this issue needs to happen here: https://github.com/googleapis/python-bigquery-pandas/blob/d9211069f3f744d75178c102757ea519185dbcff/pandas_gbq/schema.py#L108
I suspect uint8 is hitting out "string" fallback dtype.
Aside: I see we always are hitting the "table already exists" case in the google-cloud-bigquery library. Now that we're using BQ Load jobs, I think we can try removing all of our type inference logic from this library as well as the following logic to solve this issue:
https://github.com/googleapis/python-bigquery-pandas/blob/d9211069f3f744d75178c102757ea519185dbcff/pandas_gbq/gbq.py#L1350-L1359
and
https://github.com/googleapis/python-bigquery-pandas/blob/d9211069f3f744d75178c102757ea519185dbcff/pandas_gbq/gbq.py#L1197-L1218