python-bigquery-dataframes
python-bigquery-dataframes copied to clipboard
feat: add `GroupBy.__iter__`
Note: this is a work in progress. We have two choices for the interface, and I find myself flip flopping between the two:
- Return an iterable of pandas objects, similar to
to_pandas_batches(). To make sure we end up with all the rows together, this would mean (a) create a struct of all non-grouped fields and (b)array_aggthose structs and (c) unpack those arrays and structs into DataFrame objects locally. - Return an iterable of bigframes objects, each filtered to match rows belonging to the corresponding group. This would involve running a query and iterating through the results to get all the key values and then for each key value, return a DataFrame (or Series) with the corresponding filter attached.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)
Fixes internal bug 383638782 🦕
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
e2e failures:
FAILED tests/system/large/blob/test_function.py::test_blob_image_resize_to_series
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_df_where_mask_series
FAILED tests/system/large/blob/test_function.py::test_blob_pdf_chunk[True] - ...
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_with_connection[bq_connection]
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_series_apply_array_output
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_df_apply_axis_1_aggregates
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_array_output
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_dataframe_apply_axis_1
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_df_where_mask
FAILED tests/system/large/blob/test_function.py::test_blob_image_resize_to_folder
FAILED tests/system/large/blob/test_function.py::test_blob_pdf_chunk[False]
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_options
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_series_combine
FAILED tests/system/large/blob/test_function.py::test_blob_pdf_extract[False]
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_dataframe_apply_axis_1_array_output
FAILED tests/system/large/blob/test_function.py::test_blob_image_blur_to_folder
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_series_apply_args
These appear unrelated to this change.
doctest failure:
__ [doctest] third_party.bigframes_vendored.pandas.core.frame.DataFrame.join ___
[gw4] linux -- Python 3.12.7 /tmpfs/src/github/python-bigquery-dataframes/.nox/doctest/bin/python
4666 >>> df1.join(df2, how="inner")
4667 col1 col2 col3 col4
4668 11 bar 2 foo 3
4669 <BLANKLINE>
4670 [1 rows x 4 columns]
4671
4672
4673 Another option to join using the key columns is to use the on parameter:
4674
4675 >>> df1.join(df2, on="col1", how="right")
UNEXPECTED EXCEPTION: TypeError('Cannot coerce string and Int64 to a common type.')
Traceback (most recent call last):
File "/usr/local/lib/python3.12/doctest.py", line 1368, in __run
exec(compile(example.source, filename, "single",
File "<doctest third_party.bigframes_vendored.pandas.core.frame.DataFrame.join[11]>", line 1, in <module>
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/log_adapter.py", line 195, in wrapper
raise e
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/log_adapter.py", line 180, in wrapper
return method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/dataframe.py", line 3694, in join
return self._join_on_key(
^^^^^^^^^^^^^^^^^^
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/log_adapter.py", line 195, in wrapper
raise e
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/log_adapter.py", line 180, in wrapper
return method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/dataframe.py", line 3756, in _join_on_key
combined_df = left._perform_join_by_index(right, how=how)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/log_adapter.py", line 195, in wrapper
raise e
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/log_adapter.py", line 180, in wrapper
return method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/dataframe.py", line 3786, in _perform_join_by_index
block, _ = self._block.join(
^^^^^^^^^^^^^^^^^
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/blocks.py", line 2584, in join
return join_mono_indexed(
^^^^^^^^^^^^^^^^^^
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/blocks.py", line 3066, in join_mono_indexed
combined_expr, (get_column_left, get_column_right) = left_expr.relational_join(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/array_value.py", line 486, in relational_join
if not bigframes.dtypes.can_compare(ltype, rtype):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/dtypes.py", line 362, in can_compare
coerced_type = coerce_to_common(type1, type2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/dtypes.py", line 896, in coerce_to_common
raise TypeError(f"Cannot coerce {etype1} and {etype2} to a common type.")
TypeError: Cannot coerce string and Int64 to a common type.
/[tmpfs/src/github/python-bigquery-dataframes/third_party/bigframes_vendored/pandas/core/frame.py:4675](https://cs.corp.google.com/piper///depot/google3/tmpfs/src/github/python-bigquery-dataframes/third_party/bigframes_vendored/pandas/core/frame.py?l=4675): UnexpectedException
This seems unrelated to the current change.