python-bigquery-dataframes perf: defer query in `read

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

[ ] Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
[ ] Ensure the tests and linter pass
[ ] Code coverage does not decrease (if any source code was changed)
[ ] Appropriate docs were updated (if necessary)

Fixes internal issue 405773140 🦕

Apr 27 '25 03:04 tswast

Failures look like real failures.

        if not 200 <= response.status_code < 300:
>           raise exceptions.from_http_response(response)
E           google.api_core.exceptions.BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/python-docs-samples-tests/queries/32f42306-e95f-48bc-a2fb-56761aec5476?maxResults=0&location=US&prettyPrint=false: Invalid field name "_TABLE_SUFFIX". Field names are not allowed to start with the (case-insensitive) prefixes _PARTITION, _TABLE_, _FILE_, _ROW_TIMESTAMP, __ROOT__ and _COLIDENTIFIER
E           
E           Location: US
E           Job ID: 32f42306-e95f-48bc-a2fb-56761aec5476

While we can query such fields, it looks like we can't materialize them.

Apr 28 '25 14:04 tswast

Added do not merge. Need to make sure this is compatible with to_gbq() and cached().

Apr 28 '25 22:04 tswast

From the notebook tests:

File /[tmpfs/src/github/python-bigquery-dataframes/bigframes/core/nodes.py:711](https://cs.corp.google.com/piper///depot/google3/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/nodes.py?l=711), in GbqTable.from_table(table, columns)
    708 @staticmethod
    709 def from_table(table: bq.Table, columns: Sequence[str] = ()) -> GbqTable:
    710     # Subsetting fields with columns can reduce cost of row-hash default ordering
--> 711     table_schema = bigframes.core.tools.bigquery.get_schema_and_pseudocolumns(table)
    713     if columns:
    714         schema = tuple(item for item in table_schema if item.name in columns)

AttributeError: module 'bigframes.core.tools' has no attribute 'bigquery'

Looks like we're missing some imports too.

Apr 28 '25 23:04 tswast

Tested the failing samples tests locally. I think my latest commits solve the issue of not being able to materialize to _TABLE_SUFFIX as the column name.

Apr 29 '25 15:04 tswast

e2e and notebook failures are for the same:

E               google.api_core.exceptions.BadRequest: 400 'FOR SYSTEM_TIME AS OF' expression for table 'bigframes-load-testing.bigframes_testing.penguins_dcdc3525965d3bf2805a055ee80a0ae7' evaluates to a TIMESTAMP value in the future: 2025-04-29 15:16:24.477885 UTC.; reason: invalidQuery, location: query, message: 'FOR SYSTEM_TIME AS OF' expression for table 'bigframes-load-testing.bigframes_testing.penguins_dcdc3525965d3bf2805a055ee80a0ae7' evaluates to a TIMESTAMP value in the future: 2025-04-29 15:16:24.477885 UTC.

I don't think these relate to this change, but I do recall BQML having a hard time with time travel. Potentially we're missing a force cache somewhere now?

Apr 29 '25 15:04 tswast

Notebook failures appear to be flakes, as they are related to remote functions and succeeded in 3.10 but not 3.11.

nox > * notebook-3.10: success
nox > * notebook-3.11: failed

Apr 29 '25 18:04 tswast

e2e failure might indicate a real issue FAILED tests/system/large/operations/test_semantics.py::test_sim_join[has_score_column]

>           raise exceptions.from_http_response(response)
E           google.api_core.exceptions.BadRequest: 400 GET [https://bigquery.googleapis.com/bigquery/v2/projects/bigframes-load-testing/queries/e2e2cede-621f-47fc-9fbc-4077c0579eaf?maxResults=0&location=US&prettyPrint=false](https://www.google.com/url?q=https://bigquery.googleapis.com/bigquery/v2/projects/bigframes-load-testing/queries/e2e2cede-621f-47fc-9fbc-4077c0579eaf?maxResults%3D0%26location%3DUS%26prettyPrint%3Dfalse&sa=D): Duplicate column names in the result are not supported when a destination table is present. Found duplicate(s): creatures

Apr 30 '25 22:04 tswast

Different set of e2e failures this time:

FAILED tests/system/large/blob/test_function.py::test_blob_pdf_chunk[True-expected0]
FAILED tests/system/large/blob/test_function.py::test_blob_image_normalize_to_series
FAILED tests/system/large/blob/test_function.py::test_blob_image_blur_to_series
FAILED tests/system/large/blob/test_function.py::test_blob_image_normalize_to_folder
FAILED tests/system/large/blob/test_function.py::test_blob_image_blur_to_folder
FAILED tests/system/large/blob/test_function.py::test_blob_image_normalize_to_bq
FAILED tests/system/large/blob/test_function.py::test_blob_image_blur_to_bq
FAILED tests/system/large/blob/test_function.py::test_blob_image_resize_to_series
ERROR tests/system/large/blob/test_function.py::test_blob_pdf_extract[True-expected0]

Maybe it was just an LLM-related flake last time?

Apr 30 '25 23:04 tswast

I'll split to defer part from the pseudocolumn work into https://github.com/googleapis/python-bigquery-dataframes/pull/1689

I think the defer step can be done without out that to make this PR a lot smaller.

May 05 '25 20:05 tswast

perf: defer query in `read_gbq` with wildcard tables