test `read_gbq` with `dryRun` in `configuration` parameter
Hello,
Recently a question featured on SO asking about running a read_gbq job with dryRun settings defined as True.
As it turns out, for what I could check, currently we can send query definitions but everything defined outside of query is discarded.
I wonder if it would be possible to also consider updating other values such as dryRun.
kwargs should probably be able to receive arguments such as configuration={"query": {...}, "dryRun": True}
and run_query probably would have to process job_config.update(config).
Best,
Will
Are there other properties besides dryRun that should be sent?
I think a general “update” call would be a good way to implement this. We’d probably want some checks for duplicate values. I think the current implementation checks that query is not also defined in the job config.
I believe I might have fixed this issue in https://github.com/pydata/pandas-gbq/pull/152. I'll add a test to try out a "dryRun": True query before closing this issue.
Unfortunately even with googleapis/python-bigquery-pandas#152, still can't do dry run queries because it raises when google-cloud-bigquery tries to fetch the results.
Test code (query from analyzing PyPI downloads):
def test_configuration_with_dryrun(self):
query = """SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pandas-gbq'
-- Only query the last 30 days of history
AND _TABLE_SUFFIX
BETWEEN FORMAT_DATE(
'%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
"""
config = {
'dryRun': True
}
df = gbq.read_gbq(query, project_id=_get_project_id(),
private_key=self.credentials,
dialect='standard',
configuration=config)
assert df is None
Exception:
pandas_gbq/tests/test_gbq.py:786:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas_gbq/gbq.py:812: in read_gbq
schema, rows = connector.run_query(query, **kwargs)
pandas_gbq/gbq.py:534: in run_query
self.process_http_error(ex)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
ex = NotFound('GET https://www.googleapis.com/bigquery/v2/projects/swast-scratch/queries/5ce25ce1-444d-4353-b69e-8d096b392955?maxResults=0: Not found: Job swast-scratch:5ce25ce1-444d-4353-b69e-8d096b392955',)
@staticmethod
def process_http_error(ex):
# See `BigQuery Troubleshooting Errors
# <https://cloud.google.com/bigquery/troubleshooting-errors>`__
> raise GenericGBQException("Reason: {0}".format(ex))
E pandas_gbq.gbq.GenericGBQException: Reason: 404 GET https://www.googleapis.com/bigquery/v2/projects/swast-scratch/queries/5ce25ce1-444d-4353-b69e-8d096b392955?maxResults=0: Not found: Job swast-scratch:5ce25ce1-444d-4353-b69e-8d096b392955
pandas_gbq/gbq.py:450: GenericGBQException
------------------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------------------
Requesting query... ok.
Job ID: 5ce25ce1-444d-4353-b69e-8d096b392955
Query running...
Query done.
Processed: 5.7 GB Billed: 0.0 B
Standard price: $0.00 USD
Retrieving results...
=======================
I think this might be related to issue https://github.com/pydata/pandas-gbq/issues/45 and/or an issue upstream in google-cloud-bigquery.
I've confirmed that dryRun queries do work upstream in https://github.com/GoogleCloudPlatform/google-cloud-python/pull/5119 . I think pandas-gbq will need to check for dry run queries and decide not to try to fetch the results.
any progress on this?