beam icon indicating copy to clipboard operation
beam copied to clipboard

[Bug]: Some methods on DeferredSeries and DeferredDataFrame don't work right when returning single items

Open caneff opened this issue 2 years ago • 1 comments

What happened?

I've been upgrading dataframe to be compliant with Pandas 2, and a new doctest in Pandas 2 revealed this bug that currently exists in Pandas 1.

Whenever .xs() will yield a result that is a single item, it will either fail or have the wrong outcome compared to pandas xs. For examples, I have a draft PR with failing tests here

There are two different types of failures. There are the cases where we drop dimensionality (like returning a scalar for ser.xs() or a series for df.xs()) when pick exactly one element for a single Index series/dataframe. Then we get concatenation failures when trying to combine everything back together, because we are expecting to concat series/frame elements (for ser.xs/frame.xs respectively) but instead recieve a scalar/series because pandas has done the dimension reduction.

The other type of failure is with a MultiIndex, where we call df.xs() with a tuple the same length as the number of dimensions of the MultiIndex, and only a single item is returned. We still get a df but it looks wrong and has the wrong shape.

Both types of failure examples are in my link above.

For the series case, when it fails the stacktrace is:

_____________________________________________________________________ DeferredFrameTest.test_series_xs ______________________________________________________________________

self = <apache_beam.dataframe.frames_test.DeferredFrameTest testMethod=test_series_xs>

    def test_series_xs(self):
      # pandas doctests only verify DataFrame.xs, here we verify Series.xs as well
      d = {
          'num_legs': [4, 4, 2, 2],
          'num_wings': [0, 0, 2, 2],
          'class': ['mammal', 'mammal', 'mammal', 'bird'],
          'animal': ['cat', 'dog', 'bat', 'penguin'],
          'locomotion': ['walks', 'walks', 'flies', 'walks']
      }
      df = pd.DataFrame(data=d)
      df = df.set_index(['class', 'animal', 'locomotion'])
    
      self._run_test(lambda df: df.num_legs.xs('mammal'), df)
      self._run_test(lambda df: df.num_legs.xs(('mammal', 'dog')), df)
      self._run_test(lambda df: df.num_legs.xs('cat', level=1), df)
      self._run_test(
          lambda df: df.num_legs.xs(('bird', 'walks'), level=[0, 'locomotion']),
          df)
    
      df_single_index = df.reset_index().set_index('class')
    
      # Passes because Pandas xs returns a series (multiple matches for 'mammal')
>     self._run_test(lambda df: df.num_legs.xs('mammal'), df_single_index)

frames_test.py:325: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
frames_test.py:191: in _run_test
    expected = expected.sort_values(list(expected.columns))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = class
mammal    4
mammal    4
mammal    2
Name: num_legs, dtype: int64, name = 'columns'

    def __getattr__(self, name: str):
        """
        After regular attribute access, try looking up the name
        This allows simpler access to columns for interactive use.
        """
        # Note: obj.x will always call obj.__getattribute__('x') prior to
        # calling obj.__getattr__('x').
        if (
            name not in self._internal_names_set
            and name not in self._metadata
            and name not in self._accessors
            and self._info_axis._can_hold_identifiers_and_holds_name(name)
        ):
            return self[name]
>       return object.__getattribute__(self, name)
E       AttributeError: 'Series' object has no attribute 'columns'

../../../../../../.virtualenvs/env/lib/python3.11/site-packages/pandas/core/generic.py:5902: AttributeError

And for the frame case, when it fails (as opposed to just returning the wrong thing) the stack trace is:

____________________________________________________________________ DeferredFrameTest.test_dataframe_xs ____________________________________________________________________

self = <apache_beam.dataframe.frames_test.DeferredFrameTest testMethod=test_dataframe_xs>

    def test_dataframe_xs(self):
      # Test cases reported in BEAM-13421
      df = pd.DataFrame(
          np.array([
              ['state', 'day1', 12],
              ['state', 'day1', 1],
              ['state', 'day2', 14],
              ['county', 'day1', 9],
          ]),
          columns=['provider', 'time', 'value'])
    
      # Passes because Pandas xs returns a frame (multiple matches for 'state')
      self._run_test(lambda df: df.xs('state'), df.set_index(['provider']))
      # Fails because Pandas xs returns a series (single match for 'county')
>     self._run_test(lambda df: df.xs('county'), df.set_index(['provider']))

frames_test.py:343: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
frames_test.py:198: in _run_test
    pd.testing.assert_series_equal(expected, actual)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

left = time     day1
value       9
Name: county, dtype: object, right =           0 time value
time   day1  NaN   NaN
value     9  NaN   NaN
cls = <class 'pandas.core.series.Series'>

    def _check_isinstance(left, right, cls):
        """
        Helper method for our assert_* methods that ensures that
        the two objects being compared have the right type before
        proceeding with the comparison.
    
        Parameters
        ----------
        left : The first object being compared.
        right : The second object being compared.
        cls : The class type to check against.
    
        Raises
        ------
        AssertionError : Either `left` or `right` is not an instance of `cls`.
        """
        cls_name = cls.__name__
    
        if not isinstance(left, cls):
            raise AssertionError(
                f"{cls_name} Expected type {cls}, found {type(left)} instead"
            )
        if not isinstance(right, cls):
>           raise AssertionError(
                f"{cls_name} Expected type {cls}, found {type(right)} instead"
            )
E           AssertionError: Series Expected type <class 'pandas.core.series.Series'>, found <class 'pandas.core.frame.DataFrame'> instead

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • [X] Component: Python SDK
  • [ ] Component: Java SDK
  • [ ] Component: Go SDK
  • [ ] Component: Typescript SDK
  • [ ] Component: IO connector
  • [ ] Component: Beam examples
  • [ ] Component: Beam playground
  • [ ] Component: Beam katas
  • [ ] Component: Website
  • [ ] Component: Spark Runner
  • [ ] Component: Flink Runner
  • [ ] Component: Samza Runner
  • [ ] Component: Twister2 Runner
  • [ ] Component: Hazelcast Jet Runner
  • [ ] Component: Google Cloud Dataflow Runner

caneff avatar Sep 20 '23 15:09 caneff

Seeing a similar error on a different operation:

from apache_beam.dataframe.doctests import teststring

test = """
   With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]},
            ...                   index=['cat', 'dog', 'dog', 'mouse'])
            >>> df
                   a   b
              cat  1   3
              dog  2   4
              dog  2   4
            mouse  3   4
            >>> df.kurt()
            a   1.5
            b   4.0
            dtype: float64

            Passes on Beam with axis=1

            >>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]},
            ...                   index=['cat', 'dog'])
            >>> df.kurt(axis=1)
            cat   -6.0
            dog   -6.0
            dtype: float64

            Fails on Beam with axis=None

            >>> df.kurt(axis=None).round(6)
            -0.988693

"""
teststring(test)
    AttributeError: '_DeferredScalar' object has no attribute 'round'

tvalentyn avatar May 01 '24 00:05 tvalentyn