pandas icon indicating copy to clipboard operation
pandas copied to clipboard

ENH: Support mask in groupby sum

Open phofl opened this issue 3 years ago • 0 comments

  • [x] xref #37493 (Replace xxxx with the Github issue number)
  • [x] Tests added and passed if fixing a bug or adding a new feature
  • [x] All code checks passed.
  • [x] Added type annotations to new arguments/methods/functions.
  • [x] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

phofl avatar Aug 09 '22 21:08 phofl

Are you ok with merging this? Joris is linked for notification purposes

phofl avatar Aug 11 '22 17:08 phofl

Are you ok with merging this? Joris is linked for notification purposes

Sure, we can have a followup PR if any are needed.

mroeschke avatar Aug 11 '22 17:08 mroeschke

Not strictly specific to this PR (it's already existing behaviour), but noticed from looking at the changes here: the resulting out has the same type as the input values (both typed as the fused sum_t). However, for a plain sum (not grouped sum), we (or numpy) always output (u)int64 for any integer input dtype, in contrast to float32 and float64 which keep there precision in sum. So now that we introduce integer support for the grouped sum (in addition to float), we should maybe also consider making this consistent with Series.sum behaviour?

(I suppose right now (before this PR), after going through float in the group_sum algo, we cast back to the original dtype, so preserving the input dtype)

jorisvandenbossche avatar Aug 11 '22 18:08 jorisvandenbossche

(the same also applies to prod https://github.com/pandas-dev/pandas/pull/48027)

jorisvandenbossche avatar Aug 11 '22 18:08 jorisvandenbossche

This is indeed a change in behavior, currently we are casting int8 to float, if they are to large. This pr causes a overflow. Will fix this in a follow up. We can simply use int64 or uint64 as out dtype and try to cast back if possible afterwards

phofl avatar Aug 11 '22 18:08 phofl

Ah, yes, it's indeed a behaviour change that it can now overflow (because of not casting to float before the algo). We currently try to cast back, and that can still give an overflow. ~~We currently warn about that~~ (edit the example below is actually warning in the Series constructor because I used too large values ...):

In [8]: pd.Series([150, 150, 3, 100], dtype="int8").groupby([0, 0, 1, 1]).sum()
<ipython-input-8-03db26e15165>:1: FutureWarning: Values are too large to be losslessly cast to int8. In a future version this will raise OverflowError. To retain the old behavior, use pd.Series(values).astype(int8)
  pd.Series([150, 150, 3, 100], dtype="int8").groupby([0, 0, 1, 1]).sum()
Out[8]: 
0   -212.0
1    103.0
dtype: float64

so in addition to using (u)int64 as out dtype inside the group_sum algo, I think we should also consider to not cast back (we also don't do that for Series.sum)

jorisvandenbossche avatar Aug 11 '22 18:08 jorisvandenbossche

Just that I understand you correctly, you don’t want to cast back at all , independently of overflow issues?

don’t have a strong opinion. Just something to consider: the output of a sum might be considerably smaller than from a groupby operation, this might be important when considering the memory footprint. But I would be open to not casting back in general, just wanted to mention this

phofl avatar Aug 11 '22 19:08 phofl

Independently of this, this is also an issue for cumsum

phofl avatar Aug 11 '22 19:08 phofl

Yes, it is certainly true that because of the grouping, you might less easily run into overflow. Although with sufficiently large data / few large groups, I think in practice people can also easily get that. And it's also true that with a plain sum, you get a scalar result (for which memory typically won't matter that much), while for groupby you have a full column, for which the data type might be more important.

Now, this is an existing issue. Before this PR (+ https://github.com/pandas-dev/pandas/pull/48059), we tried to cast back the float values to original dtype (and kept float if not possible), and not we cast back the (u)int64 values to the original dtype (and keep the (u)int64 if not possible). So at least that is already an improvement, and I can still open an issue about the "try to cast back" in general.

jorisvandenbossche avatar Aug 12 '22 18:08 jorisvandenbossche

Yes agreed. We are better off now than before. As mentioned above I am not opposed to avoid casting back. But this is a bit out of scope here, so I think a new issue is a good idea.

phofl avatar Aug 12 '22 18:08 phofl