python-bigquery-pandas
python-bigquery-pandas copied to clipboard
to_gbq() failed to stream record type to BigQuery
Looks like there's some issue when streaming RECORD type data to BigQuery. Here're two examples how you can reproduce this issue. Any help is appreciated!
Without specifying table_schema
arg, no errors but loads column b as STRING type.
df = pd.DataFrame([[0,{'test':1}]], columns = ['a', 'b'])
gbq.to_gbq(df, destination_table='temp.record_type_1', project_id=[PROJECT_NAME])
Adding table_schema
would return a 400 error as it treats the RECORD type column as strings.
df = pd.DataFrame([[0,{'test':1}]], columns = ['a', 'b'])
table_schema = [
{"type": "INTEGER", "mode": "NULLABLE", "name": "a"},
{"type": "RECORD", "mode":"NULLABLE", "name": "b",
"fields":[{"type": "INTEGER", "mode":"NULLABLE", "name": "test"}]
}]
gbq.to_gbq(df, destination_table='temp.record_type_2', project_id=[PROJECT_NAME], table_schema=table_schema)
returns
GenericGBQException: Reason: 400 POST https://www.googleapis.com/upload/bigquery/v2/projects/[PROJECT_NAME]/jobs?uploadType=resumable: Provided Schema does not match Table [PROJECT_NAME]:temp.record_type_2. Field b has changed type from RECORD to STRING
Package:
pandas==0.22.0
pandas-gbq==0.3.1
Thanks for the report
Representing nested data structures is not pandantic, and this isn't currently supported in writing to GBQ.
One solution is to load them flattened to BQ and then run a BQ query to nest them
We can leave this open for tracking
Thanks @maxim-lian. Running a BQ query might be a workaround for adhoc jobs. But I think I am going to serialize the data in JSON, load it to Google Cloud Storage and then copy over to BQ with the right schema.
@LiChangNY yes that's a great solution
hi any chance record fields are supported in pandas_gbq now?
@hilahersz — pandas really isn't the best tool for nested data structures. I'm less involved in this library these days, but I don't think it's implemented yet.
Closing this out as a "Won't Fix".