python-bigquery-pandas
python-bigquery-pandas copied to clipboard
Better error message for incorrect to_gbq() table_schema
I am using pandas with BigQuery tables. Because pandas doesn't support boolean missing values, BigQuery BOOLEAN columns become dataframe object columns. I have to use to_gbq() table_schema to convert force those columns back to BOOLEAN.
My dataframe has 104 columns. table_schema has to include every column (until this issue is fixed).
If something in table_schema is wrong, I get the error:
GenericGBQExceptionTraceback (most recent call last)
<ipython-input-174-5109529cfab8> in <module>()
117 project_id='bryancrampton-testing',
118 table_schema=table_schema,
--> 119 if_exists='replace')
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in to_gbq(self, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key, auth_local_webserver, table_schema)
1185 verbose=verbose, reauth=reauth, if_exists=if_exists,
1186 private_key=private_key, auth_local_webserver=auth_local_webserver,
-> 1187 table_schema=table_schema)
1188
1189 @classmethod
/usr/local/lib/python2.7/dist-packages/pandas/io/gbq.pyc in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key, auth_local_webserver, table_schema)
117 verbose=verbose, reauth=reauth, if_exists=if_exists,
118 private_key=private_key, auth_local_webserver=auth_local_webserver,
--> 119 table_schema=table_schema)
/usr/local/lib/python2.7/dist-packages/pandas_gbq/gbq.pyc in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key, auth_local_webserver, table_schema)
979 connector.load_data(
980 dataframe, dataset_id, table_id, chunksize=chunksize,
--> 981 schema=table_schema)
982
983
/usr/local/lib/python2.7/dist-packages/pandas_gbq/gbq.pyc in load_data(self, dataframe, dataset_id, table_id, chunksize, schema)
565 ((total_rows - remaining_rows) * 100) / total_rows))
566 except self.http_error as ex:
--> 567 self.process_http_error(ex)
568
569 logger.info("\n")
/usr/local/lib/python2.7/dist-packages/pandas_gbq/gbq.pyc in process_http_error(ex)
454 # <https://cloud.google.com/bigquery/troubleshooting-errors>`__
455
--> 456 raise GenericGBQException("Reason: {0}".format(ex))
457
458 def run_query(self, query, **kwargs):
The error message doesn't say which part of table_schema is wrong. I have to double-check all 104 columns.
I went to BigQuery UI and looked at error for last job. There's another cryptic message:
Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the error stream for more details. (error code: invalid)
Error while reading data, error message: CSV table references column position 103, but line starting at position:0 contains only 48 columns. (error code: invalid)
It turns out table_schema had 104 columns, while the dataframe being written had 48 columns.
So one suggestion is:
- to_gbq() checks if table_schema has same # of columns as df. If not, raise an exception, rather than try to write table_schema to BigQuery. (Of course this suggestion is no longer valid after #218 is fixed)
More detailed error messages would be much appreciated
#218 is fixed, so most of this issue should be resolved. There has also been an effort to improve error messages broadly across the BigQuery API and other tools, so it's not clear if this issue persists in other ways. If you have a new example of a confusing error message, please file a new issue. Thank you!