google-cloud-python icon indicating copy to clipboard operation
google-cloud-python copied to clipboard

test `read_gbq` with `dryRun` in `configuration` parameter

Open WillianFuks opened this issue 8 years ago • 6 comments

Hello,

Recently a question featured on SO asking about running a read_gbq job with dryRun settings defined as True.

As it turns out, for what I could check, currently we can send query definitions but everything defined outside of query is discarded.

I wonder if it would be possible to also consider updating other values such as dryRun.

kwargs should probably be able to receive arguments such as configuration={"query": {...}, "dryRun": True}

and run_query probably would have to process job_config.update(config).

Best,

Will

WillianFuks avatar Sep 22 '17 17:09 WillianFuks

Are there other properties besides dryRun that should be sent?

tswast avatar Dec 08 '17 17:12 tswast

I think a general “update” call would be a good way to implement this. We’d probably want some checks for duplicate values. I think the current implementation checks that query is not also defined in the job config.

tswast avatar Feb 12 '18 04:02 tswast

I believe I might have fixed this issue in https://github.com/pydata/pandas-gbq/pull/152. I'll add a test to try out a "dryRun": True query before closing this issue.

tswast avatar Mar 22 '18 23:03 tswast

Unfortunately even with googleapis/python-bigquery-pandas#152, still can't do dry run queries because it raises when google-cloud-bigquery tries to fetch the results.

Test code (query from analyzing PyPI downloads):

    def test_configuration_with_dryrun(self):
        query = """SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pandas-gbq'
  -- Only query the last 30 days of history
  AND _TABLE_SUFFIX
    BETWEEN FORMAT_DATE(
      '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
"""
        config = {
            'dryRun': True
        }
        df = gbq.read_gbq(query, project_id=_get_project_id(),
                          private_key=self.credentials,
                          dialect='standard',
                          configuration=config)
        assert df is None

Exception:

pandas_gbq/tests/test_gbq.py:786:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas_gbq/gbq.py:812: in read_gbq
    schema, rows = connector.run_query(query, **kwargs)
pandas_gbq/gbq.py:534: in run_query
    self.process_http_error(ex)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

ex = NotFound('GET https://www.googleapis.com/bigquery/v2/projects/swast-scratch/queries/5ce25ce1-444d-4353-b69e-8d096b392955?maxResults=0: Not found: Job swast-scratch:5ce25ce1-444d-4353-b69e-8d096b392955',)

    @staticmethod
    def process_http_error(ex):
        # See `BigQuery Troubleshooting Errors
        # <https://cloud.google.com/bigquery/troubleshooting-errors>`__

>       raise GenericGBQException("Reason: {0}".format(ex))
E       pandas_gbq.gbq.GenericGBQException: Reason: 404 GET https://www.googleapis.com/bigquery/v2/projects/swast-scratch/queries/5ce25ce1-444d-4353-b69e-8d096b392955?maxResults=0: Not found: Job swast-scratch:5ce25ce1-444d-4353-b69e-8d096b392955

pandas_gbq/gbq.py:450: GenericGBQException
------------------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------------------
Requesting query... ok.
Job ID: 5ce25ce1-444d-4353-b69e-8d096b392955
Query running...
Query done.
Processed: 5.7 GB Billed: 0.0 B
Standard price: $0.00 USD

Retrieving results...
=======================

I think this might be related to issue https://github.com/pydata/pandas-gbq/issues/45 and/or an issue upstream in google-cloud-bigquery.

tswast avatar Mar 22 '18 23:03 tswast

I've confirmed that dryRun queries do work upstream in https://github.com/GoogleCloudPlatform/google-cloud-python/pull/5119 . I think pandas-gbq will need to check for dry run queries and decide not to try to fetch the results.

tswast avatar Mar 26 '18 23:03 tswast

any progress on this?

ramicaza avatar Sep 25 '22 07:09 ramicaza