google-cloud-python icon indicating copy to clipboard operation
google-cloud-python copied to clipboard

Slow read_gbq comparing to QueryJob.to_dataframe

Open tulinski-rtbh opened this issue 7 months ago • 3 comments

There is a substantial difference in execution time between pandas_gbq.read_gbq and google.cloud.bigquery.job.query.QueryJob.to_dataframe functions.

The reason is pandas_gbq.read_gbq calls rows_iter.to_dataframe here with dtypes=conversion_dtypes which causes converting columns in RowIterator.to_dataframe.

conversion_dtypes is created this way: conversion_dtypes = _bqschema_to_nullsafe_dtypes(schema_fields)

Why this is default behavior? In our case we don't need these costly column conversions.

Profiling results

117.538    RowIterator.to_dataframe        google/cloud/bigquery/table.py:2261
    65.551      DataFrame.__setitem__      pandas/core/frame.py:3630
    48.537      RowIterator.to_arrow       google/cloud/bigquery/table.py:2058
    2.302       table_to_dataframe

Column transformations take 55% time of RowIterator.to_dataframe.

Environment details

  • Python version: 3.10.12
  • pandas-gbq version: 0.28.0

tulinski-rtbh avatar May 08 '25 12:05 tulinski-rtbh

Hi @tulinski, thanks for raising the issue. The default mapper is necessary because some BigQuery types don't map to the types that pandas defaults to. Now you can add to the column types through passing a dictionary as the parameter dtypes, but indeed there isn't a way to remove the existing mappers.

We could offer a flag such as overwrite_default_types but I think it might become really confusing to general users.

A pretty hacky way, though, is to pass a dtypes={"FLOAT": None, "INTEGER": None, "TIME": None, "BOOLEAN": None} to overwrite it, because we use dictionary.update() to combine the user provided dict with the default one. I think it might work as a temporary solution.

Linchin avatar May 13 '25 00:05 Linchin

Hi @Linchin, thank you for your response.

Could you please give an example of a query where the default mapper make difference between pandas_gbq.read_gbq and google.cloud.bigquery.job.query.QueryJob.to_dataframe default behavior?

We've experimented with FLOAT, INTEGER, TIME, BOOLEAN BigQuery data types (including NULL values) and there was no difference in DataFrames produced by pandas_gbq.read_gbq and QueryJob.to_dataframe.

tulinski-rtbh avatar May 13 '25 11:05 tulinski-rtbh

Hhh

Leez666 avatar Oct 08 '25 06:10 Leez666