pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: groupby.apply result doesn't have consistent shape for single vs multiple group cases

Open oldrichsmejkal opened this issue 3 years ago • 4 comments

  • [ x ] I have checked that this issue has not already been reported.

  • [ x ] I have confirmed this bug exists on the latest version of pandas.

  • [ ] (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

# Prepare data, df_multi is multiple symbols
df_multi = pd.DataFrame(
    {'time': [1,2,1,2],
     'symbol': ['a','a','b','b'],
     'value1': [10, 12, 20, 21],
     'value2': [20, 22, 40, 41]
    }
).set_index(['time', 'symbol'])

df_single = df_multi[df_multi.index.get_level_values('symbol') == 'a'].copy()


res_multi = df_multi.groupby(level = 'symbol').apply(lambda x: x['value1']) # we are applying trivial function here, but goal is here can be complex calculation
print(res_multi)

res_single = df_single.groupby(level = 'symbol').apply(lambda x: x['value1'])
print(res_single)

Problem description

res_multi and res_single have inconsistent shapes. Output in res_multi is correct, output in res_single have index level[0] in columns instead on in index. Output seems to be consistent, after calling something like (but I am not sure how general it is..): res_single.swapaxis(level = 1).unstack() So something like this is possible fix. (I believe it should be handled in pandas, because otherwise all downstream code have to handle single group case separately.)

Output of pd.show_versions()

INSTALLED VERSIONS

commit : c7f7443c1bad8262358114d5e88cd9c8a308e8aa python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.13.5-xanmod1 Version : #0~git20210725.ec5a2ac SMP PREEMPT Sun Jul 25 17:24:47 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.1 numpy : 1.21.0 pytz : 2021.1 dateutil : 2.8.1 pip : 21.1.2 setuptools : 52.0.0.post20210125 Cython : None pytest : 6.2.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : 7.24.1 pandas_datareader: 0.10.0 bs4 : None bottleneck : 1.3.2 fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : 2.7.3 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.7.0 sqlalchemy : 1.4.19 tables : None tabulate : 0.8.9 xarray : None xlrd : None xlwt : None numba : 0.53.1

oldrichsmejkal avatar Jul 27 '21 14:07 oldrichsmejkal

This seems to me to be an example where apply's inference cannot succeed in all cases. In such a case, I would typically suggest using groupby.transform instead as it no longer needs to infer; however we can't here because transform fails.

In this case, the fast_path of DataFrameGroupBy._choose_path succeeds but the slowpath fails. In such a case, we could go forward with the fast_path instead of raising like we currently do. This is somewhat safe as it would only allow code that previously raised an error to now work.

However, it seems to me the fast_path / slow_path makes pandas prone to inconsistencies, depending on path chosen, and hard for users to use as what object their transformer gets depends on how it behaves. Instead, could we allow user control of this by adding an argument, e.g.:

 df_multi.groupby(level = 'symbol').transform(lambda x: x['value1'], by='group')  # "fast-path"

vs

 df_multi.groupby(level = 'symbol').transform(lambda x: x['value1'], by='column')  # "slow-path"

This would at least give control, and the user would know whether the UDF provided should accept a DataFrame or Series.

rhshadrach avatar Aug 02 '22 23:08 rhshadrach

I ran into this problem as well these days.

On the original approach with apply, I investigated in a debugger. I think the two cases take different paths after this line:

https://github.com/pandas-dev/pandas/blob/7214f857e45965fe7a0cfbba1d2dc35abb0fd7f4/pandas/core/groupby/generic.py#L1109

The variable all_indexed_same checks whether all series produced by applying the function on the groups have the same index. When there is only one group and thus only one series, this variable is True, otherwise it is False. And when all_indexed_same is True, then groupby arranges the series in rows of the output data frame.

I doubt a fix is possible without introducing an inconsistency for the use case where all_indexed_same is intentionally True. Maybe one could check whether the single series produced on the one group has the same index as the original data frame and then continue as for not all_indexed_same. But then there could be edge cases where this is the case by accident, which would introduce another inconsistency...

On the approach of using transform, my understanding was that transform expects a function that does not modify the shape or is broadcastable to the original shape of the group's data frame. In this issue's case, the function takes a data frame and returns a series. In this case, would one then have to switch off broadcasting?

bherwerth avatar Aug 05 '22 19:08 bherwerth

my understanding was that transform expects a function that does not modify the shape or is broadcastable to the original shape of the group's data frame...

This is currently true. Also, the function must be able to act column-by-column. What I'm discussing is removing these restrictions. To me, even something like the following makes sense as a transform, and currently works with the fast-path (have to also disable reindexing the columns of the result):

def foo(x):
    return pd.concat([x['b'], x['b'].rename('y') + 2, x['c']], axis=1)

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': [7, 8, 9]})
print(df.groupby('a').transform(foo))

#    b  y  c
# 0  3  5  7
# 1  4  6  8
# 2  5  7  9

In this case, would one then have to switch off broadcasting?

I don't follow why this would need to be done - can you expand on this? The key feature needed about the result from the supplied UDF is that it needs to be able to be either broadcasted or aligned to the input group's index. However thought needs to be put into the logic of whether to broadcast or align.

rhshadrach avatar Aug 06 '22 13:08 rhshadrach

What you describe seems like a useful extension of transform to me.

I don't follow why this would need to be done - can you expand on this? The key feature needed about the result from the supplied UDF is that it needs to be able to be either broadcasted or aligned to the input group's index.

Broadcasting in the sense of aligning to the group's index totally makes sense to me. I was more thinking in the logic of the current implementation, which requires to also preserve the columns: If the group is a data frame of shape (m, n), the UDF returns a series of length m, then the series could be coerced to the original shape by interpreting it as shape (m, 1) and broadcasting to (m, n).

However, with the extension that you describe, only the index would need to be preserved and it would be okay to have different columns. Then my comment about broadcasting would not apply.

bherwerth avatar Aug 07 '22 14:08 bherwerth