perf: defer query in `read_gbq` with wildcard tables
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)
Fixes internal issue 405773140 🦕
Failures look like real failures.
if not 200 <= response.status_code < 300:
> raise exceptions.from_http_response(response)
E google.api_core.exceptions.BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/python-docs-samples-tests/queries/32f42306-e95f-48bc-a2fb-56761aec5476?maxResults=0&location=US&prettyPrint=false: Invalid field name "_TABLE_SUFFIX". Field names are not allowed to start with the (case-insensitive) prefixes _PARTITION, _TABLE_, _FILE_, _ROW_TIMESTAMP, __ROOT__ and _COLIDENTIFIER
E
E Location: US
E Job ID: 32f42306-e95f-48bc-a2fb-56761aec5476
While we can query such fields, it looks like we can't materialize them.
Added do not merge. Need to make sure this is compatible with to_gbq() and cached().
From the notebook tests:
File /[tmpfs/src/github/python-bigquery-dataframes/bigframes/core/nodes.py:711](https://cs.corp.google.com/piper///depot/google3/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/nodes.py?l=711), in GbqTable.from_table(table, columns)
708 @staticmethod
709 def from_table(table: bq.Table, columns: Sequence[str] = ()) -> GbqTable:
710 # Subsetting fields with columns can reduce cost of row-hash default ordering
--> 711 table_schema = bigframes.core.tools.bigquery.get_schema_and_pseudocolumns(table)
713 if columns:
714 schema = tuple(item for item in table_schema if item.name in columns)
AttributeError: module 'bigframes.core.tools' has no attribute 'bigquery'
Looks like we're missing some imports too.
Tested the failing samples tests locally. I think my latest commits solve the issue of not being able to materialize to _TABLE_SUFFIX as the column name.
e2e and notebook failures are for the same:
E google.api_core.exceptions.BadRequest: 400 'FOR SYSTEM_TIME AS OF' expression for table 'bigframes-load-testing.bigframes_testing.penguins_dcdc3525965d3bf2805a055ee80a0ae7' evaluates to a TIMESTAMP value in the future: 2025-04-29 15:16:24.477885 UTC.; reason: invalidQuery, location: query, message: 'FOR SYSTEM_TIME AS OF' expression for table 'bigframes-load-testing.bigframes_testing.penguins_dcdc3525965d3bf2805a055ee80a0ae7' evaluates to a TIMESTAMP value in the future: 2025-04-29 15:16:24.477885 UTC.
I don't think these relate to this change, but I do recall BQML having a hard time with time travel. Potentially we're missing a force cache somewhere now?
Notebook failures appear to be flakes, as they are related to remote functions and succeeded in 3.10 but not 3.11.
nox > * notebook-3.10: success
nox > * notebook-3.11: failed
e2e failure might indicate a real issue FAILED tests/system/large/operations/test_semantics.py::test_sim_join[has_score_column]
> raise exceptions.from_http_response(response)
E google.api_core.exceptions.BadRequest: 400 GET [https://bigquery.googleapis.com/bigquery/v2/projects/bigframes-load-testing/queries/e2e2cede-621f-47fc-9fbc-4077c0579eaf?maxResults=0&location=US&prettyPrint=false](https://www.google.com/url?q=https://bigquery.googleapis.com/bigquery/v2/projects/bigframes-load-testing/queries/e2e2cede-621f-47fc-9fbc-4077c0579eaf?maxResults%3D0%26location%3DUS%26prettyPrint%3Dfalse&sa=D): Duplicate column names in the result are not supported when a destination table is present. Found duplicate(s): creatures
Different set of e2e failures this time:
FAILED tests/system/large/blob/test_function.py::test_blob_pdf_chunk[True-expected0]
FAILED tests/system/large/blob/test_function.py::test_blob_image_normalize_to_series
FAILED tests/system/large/blob/test_function.py::test_blob_image_blur_to_series
FAILED tests/system/large/blob/test_function.py::test_blob_image_normalize_to_folder
FAILED tests/system/large/blob/test_function.py::test_blob_image_blur_to_folder
FAILED tests/system/large/blob/test_function.py::test_blob_image_normalize_to_bq
FAILED tests/system/large/blob/test_function.py::test_blob_image_blur_to_bq
FAILED tests/system/large/blob/test_function.py::test_blob_image_resize_to_series
ERROR tests/system/large/blob/test_function.py::test_blob_pdf_extract[True-expected0]
Maybe it was just an LLM-related flake last time?
I'll split to defer part from the pseudocolumn work into https://github.com/googleapis/python-bigquery-dataframes/pull/1689
I think the defer step can be done without out that to make this PR a lot smaller.