python-bigquery-dataframes icon indicating copy to clipboard operation
python-bigquery-dataframes copied to clipboard

feat: add `GroupBy.__iter__`

Open tswast opened this issue 10 months ago • 1 comments

Note: this is a work in progress. We have two choices for the interface, and I find myself flip flopping between the two:

  1. Return an iterable of pandas objects, similar to to_pandas_batches(). To make sure we end up with all the rows together, this would mean (a) create a struct of all non-grouped fields and (b) array_agg those structs and (c) unpack those arrays and structs into DataFrame objects locally.
  2. Return an iterable of bigframes objects, each filtered to match rows belonging to the corresponding group. This would involve running a query and iterating through the results to get all the key values and then for each key value, return a DataFrame (or Series) with the corresponding filter attached.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • [ ] Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • [ ] Ensure the tests and linter pass
  • [ ] Code coverage does not decrease (if any source code was changed)
  • [ ] Appropriate docs were updated (if necessary)

Fixes internal bug 383638782 🦕

tswast avatar Feb 13 '25 19:02 tswast

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

e2e failures:

FAILED tests/system/large/blob/test_function.py::test_blob_image_resize_to_series
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_df_where_mask_series
FAILED tests/system/large/blob/test_function.py::test_blob_pdf_chunk[True] - ...
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_with_connection[bq_connection]
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_series_apply_array_output
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_df_apply_axis_1_aggregates
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_array_output
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_dataframe_apply_axis_1
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_df_where_mask
FAILED tests/system/large/blob/test_function.py::test_blob_image_resize_to_folder
FAILED tests/system/large/blob/test_function.py::test_blob_pdf_chunk[False]
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_options
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_series_combine
FAILED tests/system/large/blob/test_function.py::test_blob_pdf_extract[False]
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_dataframe_apply_axis_1_array_output
FAILED tests/system/large/blob/test_function.py::test_blob_image_blur_to_folder
FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_series_apply_args

These appear unrelated to this change.

tswast avatar Sep 18 '25 21:09 tswast

doctest failure:

__ [doctest] third_party.bigframes_vendored.pandas.core.frame.DataFrame.join ___
[gw4] linux -- Python 3.12.7 /tmpfs/src/github/python-bigquery-dataframes/.nox/doctest/bin/python
4666             >>> df1.join(df2, how="inner")
4667                col1  col2 col3  col4
4668             11  bar     2  foo     3
4669             <BLANKLINE>
4670             [1 rows x 4 columns]
4671 
4672 
4673         Another option to join using the key columns is to use the on parameter:
4674 
4675             >>> df1.join(df2, on="col1", how="right")
UNEXPECTED EXCEPTION: TypeError('Cannot coerce string and Int64 to a common type.')
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/doctest.py", line 1368, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest third_party.bigframes_vendored.pandas.core.frame.DataFrame.join[11]>", line 1, in <module>
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/log_adapter.py", line 195, in wrapper
    raise e
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/log_adapter.py", line 180, in wrapper
    return method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/dataframe.py", line 3694, in join
    return self._join_on_key(
           ^^^^^^^^^^^^^^^^^^
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/log_adapter.py", line 195, in wrapper
    raise e
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/log_adapter.py", line 180, in wrapper
    return method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/dataframe.py", line 3756, in _join_on_key
    combined_df = left._perform_join_by_index(right, how=how)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/log_adapter.py", line 195, in wrapper
    raise e
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/log_adapter.py", line 180, in wrapper
    return method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/dataframe.py", line 3786, in _perform_join_by_index
    block, _ = self._block.join(
               ^^^^^^^^^^^^^^^^^
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/blocks.py", line 2584, in join
    return join_mono_indexed(
           ^^^^^^^^^^^^^^^^^^
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/blocks.py", line 3066, in join_mono_indexed
    combined_expr, (get_column_left, get_column_right) = left_expr.relational_join(
                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/core/array_value.py", line 486, in relational_join
    if not bigframes.dtypes.can_compare(ltype, rtype):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/dtypes.py", line 362, in can_compare
    coerced_type = coerce_to_common(type1, type2)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmpfs/src/github/python-bigquery-dataframes/bigframes/dtypes.py", line 896, in coerce_to_common
    raise TypeError(f"Cannot coerce {etype1} and {etype2} to a common type.")
TypeError: Cannot coerce string and Int64 to a common type.
/[tmpfs/src/github/python-bigquery-dataframes/third_party/bigframes_vendored/pandas/core/frame.py:4675](https://cs.corp.google.com/piper///depot/google3/tmpfs/src/github/python-bigquery-dataframes/third_party/bigframes_vendored/pandas/core/frame.py?l=4675): UnexpectedException

This seems unrelated to the current change.

tswast avatar Sep 18 '25 21:09 tswast