[Bug]: Some methods on DeferredSeries and DeferredDataFrame don't work right when returning single items
What happened?
I've been upgrading dataframe to be compliant with Pandas 2, and a new doctest in Pandas 2 revealed this bug that currently exists in Pandas 1.
Whenever .xs() will yield a result that is a single item, it will either fail or have the wrong outcome compared to pandas xs. For examples, I have a draft PR with failing tests here
There are two different types of failures. There are the cases where we drop dimensionality (like returning a scalar for ser.xs() or a series for df.xs()) when pick exactly one element for a single Index series/dataframe. Then we get concatenation failures when trying to combine everything back together, because we are expecting to concat series/frame elements (for ser.xs/frame.xs respectively) but instead recieve a scalar/series because pandas has done the dimension reduction.
The other type of failure is with a MultiIndex, where we call df.xs() with a tuple the same length as the number of dimensions of the MultiIndex, and only a single item is returned. We still get a df but it looks wrong and has the wrong shape.
Both types of failure examples are in my link above.
For the series case, when it fails the stacktrace is:
_____________________________________________________________________ DeferredFrameTest.test_series_xs ______________________________________________________________________
self = <apache_beam.dataframe.frames_test.DeferredFrameTest testMethod=test_series_xs>
def test_series_xs(self):
# pandas doctests only verify DataFrame.xs, here we verify Series.xs as well
d = {
'num_legs': [4, 4, 2, 2],
'num_wings': [0, 0, 2, 2],
'class': ['mammal', 'mammal', 'mammal', 'bird'],
'animal': ['cat', 'dog', 'bat', 'penguin'],
'locomotion': ['walks', 'walks', 'flies', 'walks']
}
df = pd.DataFrame(data=d)
df = df.set_index(['class', 'animal', 'locomotion'])
self._run_test(lambda df: df.num_legs.xs('mammal'), df)
self._run_test(lambda df: df.num_legs.xs(('mammal', 'dog')), df)
self._run_test(lambda df: df.num_legs.xs('cat', level=1), df)
self._run_test(
lambda df: df.num_legs.xs(('bird', 'walks'), level=[0, 'locomotion']),
df)
df_single_index = df.reset_index().set_index('class')
# Passes because Pandas xs returns a series (multiple matches for 'mammal')
> self._run_test(lambda df: df.num_legs.xs('mammal'), df_single_index)
frames_test.py:325:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
frames_test.py:191: in _run_test
expected = expected.sort_values(list(expected.columns))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = class
mammal 4
mammal 4
mammal 2
Name: num_legs, dtype: int64, name = 'columns'
def __getattr__(self, name: str):
"""
After regular attribute access, try looking up the name
This allows simpler access to columns for interactive use.
"""
# Note: obj.x will always call obj.__getattribute__('x') prior to
# calling obj.__getattr__('x').
if (
name not in self._internal_names_set
and name not in self._metadata
and name not in self._accessors
and self._info_axis._can_hold_identifiers_and_holds_name(name)
):
return self[name]
> return object.__getattribute__(self, name)
E AttributeError: 'Series' object has no attribute 'columns'
../../../../../../.virtualenvs/env/lib/python3.11/site-packages/pandas/core/generic.py:5902: AttributeError
And for the frame case, when it fails (as opposed to just returning the wrong thing) the stack trace is:
____________________________________________________________________ DeferredFrameTest.test_dataframe_xs ____________________________________________________________________
self = <apache_beam.dataframe.frames_test.DeferredFrameTest testMethod=test_dataframe_xs>
def test_dataframe_xs(self):
# Test cases reported in BEAM-13421
df = pd.DataFrame(
np.array([
['state', 'day1', 12],
['state', 'day1', 1],
['state', 'day2', 14],
['county', 'day1', 9],
]),
columns=['provider', 'time', 'value'])
# Passes because Pandas xs returns a frame (multiple matches for 'state')
self._run_test(lambda df: df.xs('state'), df.set_index(['provider']))
# Fails because Pandas xs returns a series (single match for 'county')
> self._run_test(lambda df: df.xs('county'), df.set_index(['provider']))
frames_test.py:343:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
frames_test.py:198: in _run_test
pd.testing.assert_series_equal(expected, actual)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
left = time day1
value 9
Name: county, dtype: object, right = 0 time value
time day1 NaN NaN
value 9 NaN NaN
cls = <class 'pandas.core.series.Series'>
def _check_isinstance(left, right, cls):
"""
Helper method for our assert_* methods that ensures that
the two objects being compared have the right type before
proceeding with the comparison.
Parameters
----------
left : The first object being compared.
right : The second object being compared.
cls : The class type to check against.
Raises
------
AssertionError : Either `left` or `right` is not an instance of `cls`.
"""
cls_name = cls.__name__
if not isinstance(left, cls):
raise AssertionError(
f"{cls_name} Expected type {cls}, found {type(left)} instead"
)
if not isinstance(right, cls):
> raise AssertionError(
f"{cls_name} Expected type {cls}, found {type(right)} instead"
)
E AssertionError: Series Expected type <class 'pandas.core.series.Series'>, found <class 'pandas.core.frame.DataFrame'> instead
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- [X] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
Seeing a similar error on a different operation:
from apache_beam.dataframe.doctests import teststring
test = """
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]},
... index=['cat', 'dog', 'dog', 'mouse'])
>>> df
a b
cat 1 3
dog 2 4
dog 2 4
mouse 3 4
>>> df.kurt()
a 1.5
b 4.0
dtype: float64
Passes on Beam with axis=1
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]},
... index=['cat', 'dog'])
>>> df.kurt(axis=1)
cat -6.0
dog -6.0
dtype: float64
Fails on Beam with axis=None
>>> df.kurt(axis=None).round(6)
-0.988693
"""
teststring(test)
AttributeError: '_DeferredScalar' object has no attribute 'round'