Slow read_gbq comparing to QueryJob.to_dataframe
There is a substantial difference in execution time between pandas_gbq.read_gbq and google.cloud.bigquery.job.query.QueryJob.to_dataframe functions.
The reason is pandas_gbq.read_gbq calls rows_iter.to_dataframe here with dtypes=conversion_dtypes which causes converting columns in RowIterator.to_dataframe.
conversion_dtypes is created this way:
conversion_dtypes = _bqschema_to_nullsafe_dtypes(schema_fields)
Why this is default behavior? In our case we don't need these costly column conversions.
Profiling results
117.538 RowIterator.to_dataframe google/cloud/bigquery/table.py:2261
65.551 DataFrame.__setitem__ pandas/core/frame.py:3630
48.537 RowIterator.to_arrow google/cloud/bigquery/table.py:2058
2.302 table_to_dataframe
Column transformations take 55% time of RowIterator.to_dataframe.
Environment details
- Python version: 3.10.12
pandas-gbqversion: 0.28.0
Hi @tulinski, thanks for raising the issue. The default mapper is necessary because some BigQuery types don't map to the types that pandas defaults to. Now you can add to the column types through passing a dictionary as the parameter dtypes, but indeed there isn't a way to remove the existing mappers.
We could offer a flag such as overwrite_default_types but I think it might become really confusing to general users.
A pretty hacky way, though, is to pass a dtypes={"FLOAT": None, "INTEGER": None, "TIME": None, "BOOLEAN": None} to overwrite it, because we use dictionary.update() to combine the user provided dict with the default one. I think it might work as a temporary solution.
Hi @Linchin, thank you for your response.
Could you please give an example of a query where the default mapper make difference between pandas_gbq.read_gbq and google.cloud.bigquery.job.query.QueryJob.to_dataframe default behavior?
We've experimented with FLOAT, INTEGER, TIME, BOOLEAN BigQuery data types (including NULL values) and there was no difference in DataFrames produced by pandas_gbq.read_gbq and QueryJob.to_dataframe.
Hhh