feat: add `write_engine` parameter to `read_FORMATNAME` methods to control how data is written to BigQuery
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)
Fixes internal issue 323176126 🦕
Blocked by https://github.com/googleapis/python-bigquery/issues/1815
Test failure is a real one:
E TypeError: Object of type bool_ is not JSON serializable
=================================== FAILURES =================================== _____________ test_read_csv_gcs_default_engine[bigquery_streaming] _____________ [gw19] linux -- Python 3.11.6 /tmpfs/src/github/python-bigquery-dataframes/.nox/system-3-11/bin/pythonsession = <bigframes.session.Session object at 0x7f0d0f897cd0> scalars_dfs = ( bool_col bytes_col
rowindex ...... 2038-01-19 03:14:17.999999+00:00 8 False ... <NA>[9 rows x 13 columns]) gcs_folder = 'gs://bigframes-dev-testing/bigframes_tests_system_20240229220731_1845bf/' write_engine = 'bigquery_streaming'
@skip_legacy_pandas @pytest.mark.parametrize( ("write_engine",), ( ("default",), ("bigquery_inline",), ("bigquery_load",), ("bigquery_streaming",), ), ) def test_read_csv_gcs_default_engine(session, scalars_dfs, gcs_folder, write_engine): scalars_df, _ = scalars_dfs if scalars_df.index.name is not None: path = gcs_folder + "test_read_csv_gcs_default_engine_w_index*.csv" else: path = gcs_folder + "test_read_csv_gcs_default_engine_wo_index*.csv" read_path = path.replace("*", FIRST_FILE) scalars_df.to_csv(path, index=False) dtype = scalars_df.dtypes.to_dict() dtype.pop("geography_col")df = session.read_csv(read_path, # Convert default pandas dtypes to match BigQuery DataFrames dtypes. dtype=dtype, write_engine=write_engine, )tests/system/small/test_session.py:435:
bigframes/session/init.py:1162: in read_csv return self._read_pandas( bigframes/session/init.py:933: in _read_pandas return self._read_pandas_bigquery_table( bigframes/session/init.py:988: in _read_pandas_bigquery_table table_expression = bigframes_io.pandas_to_bigquery_streaming( bigframes/session/_io/bigquery.py:294: in pandas_to_bigquery_streaming for errors in bqclient.insert_rows_from_dataframe( .nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3662: in insert_rows_from_dataframe result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs) .nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3605: in insert_rows return self.insert_rows_json(table, json_rows, **kwargs) .nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3801: in insert_rows_json response = self._call_api( .nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:827: in _call_api return call() .nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:293: in retry_wrapped_func return retry_target( .nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:153: in retry_target _retry_error_helper( .nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_base.py:212: in _retry_error_helper raise final_exc from source_exc .nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:144: in retry_target result = target() .nox/system-3-11/lib/python3.11/site-packages/google/cloud/_http/init.py:479: in api_request data = json.dumps(data) /usr/local/lib/python3.11/json/init.py:231: in dumps return _default_encoder.encode(obj) /usr/local/lib/python3.11/json/encoder.py:200: in encode chunks = self.iterencode(o, _one_shot=True) /usr/local/lib/python3.11/json/encoder.py:258: in iterencode return _iterencode(o, 0)
self = <json.encoder.JSONEncoder object at 0x7f0d3d5e8b90>, o = True
def default(self, o): """Implement this method in a subclass such that it returns a serializable object for ``o``, or calls the base implementation (to raise a ``TypeError``). For example, to support arbitrary iterators, you could implement default like this:: def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o) """raise TypeError(f'Object of type {o.__class__.__name__} 'f'is not JSON serializable')E TypeError: Object of type bool_ is not JSON serializable
/usr/local/lib/python3.11/json/encoder.py:180: TypeError ----------------------------- Captured stdout call ----------------------------- Query job 2b56fdd1-5dcc-4afd-8521-69b517d5afae is DONE.1.1 kB processed.
https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:2b56fdd1-5dcc-4afd-8521-69b517d5afae&page=queryresults Query job 7d127271-a535-48d0-b092-df329771fb89 is RUNNING. https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:7d127271-a535-48d0-b092-df329771fb89&page=queryresults Query job 7d127271-a535-48d0-b092-df329771fb89 is DONE.0 Bytes processed.
https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:7d127271-a535-48d0-b092-df329771fb89&page=queryresults =============================== warnings summary ===============================
This will likely require a fix upstream in google-cloud-bigquery, but in the meantime I can make sure to use a vendored version of insert_rows_from_dataframe that can serialize a numpy bool_ value, similar to how there's a special case for NaN already.
Edit: I already fixed this in https://github.com/googleapis/python-bigquery/pull/1816, waiting on version 3.18.0.
- [x] bump minimum google-cloud-bigquery after merging https://github.com/googleapis/python-bigquery/pull/1817
Marking as do not merge for now. Thanks for your feedback so far. I will wait until we implement go/pandas-gbq-and-bigframes-redundancy before merging this (likely mid-April).
Marking as do not merge for now. Thanks for your feedback so far. I will wait until we implement go/pandas-gbq-and-bigframes-redundancy before merging this (likely mid-April).
I've mailed https://github.com/googleapis/python-bigquery-pandas/pull/814 as a first step of this project. I plan to follow-up that PR with one that copies the bigquery_streaming from this PR to pandas-gbq so that the pandas -> BigQuery logic can be consolidated to the pandas-gbq package.