dask Groupby sort upstream compatibility

Fixes #9428, sort of. The pandas GroupBy defaults to sort=True, and None is not a valid input. Dask's implementation has sort=None as a default, so when we started to pass self.sort down, that got truthied to False. So #9302 inadvertently switched sorting of intermediate partitions from True to False, which was not obvious at the time! In the interests of upstream compatibility, this PR changes the default for sort to True. I've also updated the defaults for dropna and observed while I'm here to match upstream defaults.

That being said, we probably don't want to sort intermediate partitions when doing a groupby/agg, since that is extra work. So this PR also reintroduces the idea of #9386, but doing it deliberately this time :)

Some justification, based on some quick pandas performance measurements:

import dask.datasets

# Create a pandas dataframe with a lot of groups, dask is just used for expediency
df = dask.datasets.timeseries(id_lam=1e18).compute()

# Groupby/agg on integer column
%time df.groupby("id", sort=True).agg({"x": "sum", "y": "mean"})   # 1.7 s
%time df.groupby("id", sort=False).agg({"x": "sum", "y": "mean"})  # 500 ms

# Groupby/agg on string[python] column
df2 = df.assign(id=df.id.astype("string[python]"))

%time df2.groupby("id", sort=True).agg({"x": "sum", "y": "mean"})  # 9.4 s
%time df2.groupby("id", sort=False).agg({"x": "sum", "y": "mean"})  # 1.6 s

# Groupby/agg on string[pyarrow] column
df3 = df.assign(id=df.id.astype("string[pyarrow]"))

%time df3.groupby("id", sort=True).agg({"x": "sum", "y": "mean"})  # 7.7 s
%time df3.groupby("id", sort=False).agg({"x": "sum", "y": "mean"})  # 1.2 s

# Groupby/agg on integer categorical column
df4 = df.assign(id=df.id.astype("category"))

%time df4.groupby("id", sort=True).agg({"x": "sum", "y": "mean"})  # 700 ms
%time df4.groupby("id", sort=False).agg({"x": "sum", "y": "mean"})  # 1.4 s

# Groupby/agg on string categorical column
df5 = df.assign(id=df.id.astype("string[python]").astype("category"))

%time df5.groupby("id", sort=True).agg({"x": "sum", "y": "mean"})  # 1.6 s
%time df5.groupby("id", sort=False).agg({"x": "sum", "y": "mean"})  # 10.3 s

Note that in almost all cases, it's significantly faster to set sort=False. The major exception is with categoricals, which, bizarrely, are much faster with sort=True! This seems to be an upstream performance issue in pandas, possibly related to https://github.com/pandas-dev/pandas/issues/32976, where a cythonized fast-path is being missed.

I don't think we should be targeting categoricals specifically here, especially since it's likely not a fundamental performance gap, so I'm more comfortable just saying that generally sort=False for intermediate partitions is a better option here. I mostly bring up categoricals because that is the case that was being caught in #9428. So even though this PR won't fix that specific regression, I feel that this is the more correct and maintainable thing to do here.

Sep 13 '22 22:09 ian-r-rose

Also, cc @mroeschke, who might be interested in the weird performance characteristics around groupby/agg on categorical columns.

Sep 13 '22 22:09 ian-r-rose

Hmm, seems as if this is a breaking change for people who are using split_out > 1. Perhaps the correct thing to do here is to raise a deprecation warning if someone has sort=None and split_out > 1, then make the change in a couple of releases.

I still think it is better to be explicit about sort=False for intermediate partitions here, rather than relying on the incorrect-but-works falsiness of None.

Sep 13 '22 23:09 ian-r-rose

Okay, I've backed out the top-level API change here and replaced it with a FutureWarning. After a couple of releases, I'd say it would be okay to do the following:

Change DataFrame.groupby(sort=True) to the default to match pandas
Remove the FutureWarnings from dask/dataframe/groupby.py

Sep 16 '22 18:09 ian-r-rose

Thanks @ian-r-rose !

Sep 22 '22 18:09 rjzamora

Thanks for the review and discussion @rjzamora!

Sep 22 '22 18:09 ian-r-rose

Also, cc @mroeschke, who might be interested in the weird performance characteristics around groupby/agg on categorical columns.

Finally got a change to look at this behavior a bit.

The performance bottleneck is actually due to categorical reordering when sort=False. In the initial DataFrame, the unique categories are (string) ordered, but for groupby("cat_id", sort=False), the resulting categories need to be re-ordered based on how they appear sequentially in the grouping column. I don't think we have any special string dtype ops for this operation, so I wouldn't be surprised if this is essentially treated as object type.

Also, there has been discussion to change the default sort=True in groupby due to the general performance hit.

Sep 23 '22 16:09 mroeschke

Thanks for the info @mroeschke.

In the initial DataFrame, the unique categories are (string) ordered, but for groupby("cat_id", sort=False), the resulting categories need to be re-ordered based on how they appear sequentially in the grouping column.

I don't quite understand this -- why do the resulting categories need to be re-ordered if sort=False? That seems like it's counter to the intent.

Also, there has been discussion to change the default sort=True in groupby due to the general performance hit.

Do you have a reference to this discussion? It's my impression that @rjzamora would be very much in favor of turning off this default :)

Sep 23 '22 16:09 ian-r-rose

why do the resulting categories need to be re-ordered if sort=False?

Appears to go back to https://github.com/pandas-dev/pandas/pull/9480 and https://github.com/pandas-dev/pandas/issues/8868. It makes sense that the values shouldn't be sorted, but I'm not sure why the resulting categories also need to be unsorted. It may have been an unintended consequence of the bug fix where sort=False wasn't working with categoricals

Do you have a reference to this discussion?

I don't think there's a public discussion about this yet, but I think this has been brought up by potential sponsors who were interested in this change. I'll let you know if it becomes a public effort.

Sep 23 '22 17:09 mroeschke

dask dask copied to clipboard

Groupby sort upstream compatibility

dask
dask copied to clipboard