cudf
cudf copied to clipboard
[BUG] Pymod operation is missing `null` values
Describe the bug
When there is 0
in denominator of pymod
operation, we seem to be returning garbage value instead of null
value.
Steps/Code to reproduce bug
Type "help", "copyright", "credits" or "license" for more information.
>>> 1 % 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: integer division or modulo by zero
>>> import pandas as pd
>>> x = pd.Series([1, 2, 3, 4, 0])
>>> x % 2
0 1
1 0
2 1
3 0
4 0
dtype: int64
>>> 2 % x
0 0.0
1 0.0
2 2.0
3 2.0
4 NaN
dtype: float64
>>> import cudf
>>> x = cudf.from_pandas(x)
>>> x % 2
0 1
1 0
2 1
3 0
4 0
dtype: int64
>>> 2 % x
0 0
1 0
2 2
3 2
4 4294967295 #<- This could either be `nan` or `null`
dtype: int64
Expected behavior
Return either nan
or null
inplace of garbage values
Environment overview (please complete the following information)
- Environment location: Docker
- Method of cuDF install: from source (0.15)
Additional context This kind of looks similar to https://github.com/rapidsai/cudf/issues/5722 but not sure if they share the same code-flows, so Just putting it out there.
I guess it makes sense for a modulo zero operation to return null since it's undefined behavior but this is somewhere we would differ from Pandas.
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
cc @brandon-b-miller for more binaryop null things
Binary operations in libcudf are designed so that nulls always propagate. As a result, although we have created a specialized mod operator PYMOD
in libcudf for compatibility with Python's rules for handling negatives, we can't trivially exploit that for null handling since the operator doesn't return a nullable value.
However, in this instance there is some precedent for solving this at the libcudf level. Specifically, all the Null*
operators have special handling to bypass null propagation in the wrapper for ops. While I would normally say that we should solve this issue by post-processing the result in Python (similar to how #11441 solves #7389), the existing infrastructure for solving this problem makes me less wary of introducing this logic into C++.
@jrhemstad I'd be curious what you think about this. The tl;dr is that in pandas x % 0
returns nulls, and we're trying to figure out how best to do that. Options are (1) using the same wrapping logic that we currently use in libcudf for the Null*
operators, which I find slightly icky (but only slightly since we're already doing this, it's just a little less obvious why you'd do this for an operator that isn't strictly about specialized null handling) or (2) just post-processing the result in Python based on whether the input values contained zeros.
CC @brandon-b-miller
although we have created a specialized mod operator PYMOD in libcudf for compatibility with Python's rules for handling negatives
"PYMOD" is a misnomer. It wasn't intended to be 100% compliant with Python's %
. It was intended to just be the mathematical modulo operation vs C++'s %
is the remainder operation.
See: https://stackoverflow.com/questions/13683563/whats-the-difference-between-mod-and-remainder
I wanted to rename MOD
to REMAINDER
and PYMOD
would be MODULO
.
While I would normally say that we should solve this issue by post-processing the result in Python
This is the correct answer. It's no different than https://github.com/rapidsai/cudf/issues/7389#issuecomment-1194572293 imo.
Found the original conversation: https://github.com/rapidsai/cudf/pull/1985#pullrequestreview-248424619
What's worse is calling an operation PyMod. It doesn't tell a user of the C++ interface what it actually means. In general, we should use descriptions that stand on their own and do not resort to knowledge of another language or library.
I still agree with my past self :)
I'm fine with that explanation, although to @harrism's point we'd definitely need to document how we define remainder vs modulo since it's at best it's a nonstandard but common distinction. So perhaps we should do that rename while also addressing this issue here :)
I think resolving this issue then requires two things:
- Rename
PYMOD
toREMAINDER
(along with suitable documentation of the meaning) - Add the same kind of logic as #11441 except for modulo.
@brandon-b-miller would be up for handling this? I'm hoping it would be a near carbon copy of #11441 except with a different fill value. I can take on 1 myself if you don't want to.
happy to handle making the change 👍