cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] Floor division behavior of integer by series with zeroes differs from Pandas

Open brandon-b-miller opened this issue 4 years ago • 10 comments

Describe the bug Dividing an int by a series or dataframe with zeroes in it yields a different result from Pandas nullable dtypes.

Steps/Code to reproduce bug

import pandas as pd
1 // pd.Series([0], dtype='Int64') # Nullable
0    0
dtype: Int64

vs

>>> 1 // cudf.Series([0])
0    9223372036854775807

Expected behavior I'm not sure which result is preferred here. We could cast to float but that'd require a scan.

Environment overview (please complete the following information)

  • Environment location: Bare Metal
  • Method of cuDF install: Source

brandon-b-miller avatar Feb 16 '21 17:02 brandon-b-miller

So Pandas expects division by zero to yield zero?

jrhemstad avatar Feb 16 '21 17:02 jrhemstad

Not sure if it's a bug, something that just hasn't been implemented yet, or otherwise. Raised https://github.com/pandas-dev/pandas/issues/39847 to ask. In our case maybe we want to raise.

brandon-b-miller avatar Feb 16 '21 19:02 brandon-b-miller

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Mar 18 '21 22:03 github-actions[bot]

Based on discussion in #7492, we initially decided that the appropriate solution would be to match pandas by setting division by zero entries to zero in the output, but in subsequent discussions we decided to take no action. @brandon-b-miller at this point do you think it's worth revisiting this and doing the post facto 0 filling, or should we just close this issue?

vyasr avatar Jul 22 '22 22:07 vyasr

As long as we're doing things like INT_POW could there be precedent to add this at the libcudf level?

brandon-b-miller avatar Jul 25 '22 18:07 brandon-b-miller

As long as we're doing things like INT_POW could there be precedent to add this at the libcudf level?

INT_POW is different. It's a distinct algorithm for doing the "pow" operation.

"INT_DIVISION_BY_0_YIELDS_0" is far more niche.

If this behavior is desired, it should be a pre/post-processing step.

jrhemstad avatar Jul 25 '22 20:07 jrhemstad

I agree, I'd still want to do this as a post-processing step. I think we'd probably be OK to eat the perf hit though.

vyasr avatar Jul 26 '22 00:07 vyasr

So we're OK with doing a filter and I guess a scatter every time we do floor division by a cuDF series?

brandon-b-miller avatar Aug 01 '22 19:08 brandon-b-miller

Yeah I think so. This feels like a very typical case where cuDF Python should pay a performance cost for matching pandas behavior exactly. No need to overspecialize libcudf.

vyasr avatar Aug 01 '22 22:08 vyasr

This issue is pretty similar to #5938 in spirit, although it'll require a pretty different type of fix.

vyasr avatar Aug 02 '22 00:08 vyasr

This appears to be a moving target on the pandas side

>>> import pandas as pd
>>> pd.__version__
'1.4.3'
>>> 1 // pd.Series([0], dtype='Int64')
0    0
dtype: Int64

vs

In [1]: pd.__version__
Out[1]: '1.6.0.dev0'  # essentially 1.5

In [2]: 1 // pd.Series([0], dtype='Int64')
Out[2]:
0    inf
dtype: Float64

FWIW I do find the 0 result from pandas a bit odd. And given discussion on the pandas side related to this (https://github.com/pandas-dev/pandas/issues/30188 and https://github.com/pandas-dev/pandas/issues/32265) maybe in this case it may make sense for cuDF to return a more sensible result (e.g. not 0)?

mroeschke avatar Aug 23 '22 20:08 mroeschke

At least as of #12074, cudf will match pandas 1.5 AFAICT in these cases.

wence- avatar Nov 08 '22 18:11 wence-