spark-rapids
spark-rapids copied to clipboard
[BUG] Don't use `.min()` and `.max()` methods on floating-point columns if there's `NaN`
The methods min and max operated on floating-point columns do not output the results that Spark wants. In particular, if there are NaNs in the input, the output of these methods may be different from the Spark's desired output.
As such, when we need to compute min/max values, do not use these min and max methods on floating-point data types unless we know that there's no NaN.
@ttnghia are there specific areas where min and max are being used, or are there test cases that are missing this?
Not now, but we just hit it in a test for https://github.com/NVIDIA/spark-rapids/issues/6130.
I file this issue so we can be more aware in the future of this case.
So do you want us to document this somewhere? I am just trying to understand what the result of this is. If it is more awareness, then okay we can close this and tell everyone not to do it. If you think there might be bugs in the existing code, then we can go through the code and verify that they are all covered.
So do you want us to document this somewhere?
Yes, I think so. We can call this a bug since we can get wrong results (compared to Spark's results) with min and max on floating point columns, but I don't know how to fix it because these min and max are from cudf JNI, not from our plugin code. If we are allowed to fix it there then that's great. Otherwise, we should document this clearly.
Is the bug related to GpuMax or GpuMin? I guess we have added a workaround way to match the Spark's behavior: https://github.com/NVIDIA/spark-rapids/pull/5989
These are ColumnVector.min() and ColumnVector.max() which are Java functions from cudf, not the plugin code. These functions are only called internally, not the ones that SQL users execute.
Should we move the NaN handling (what we have done in GpuMax and GpuMin) into cuDF or rapids-JNI ?
That is also part of my question above. Moving the implementation to cudf JNI will make GpuMin/Max and min/max consistent but cudf JNI may be used somewhere else not just our plugin thus I'm not sure.
Not familiar with JNI, but I wonder could we move the implementation to https://github.com/NVIDIA/spark-rapids-jni ?
Not familiar with JNI, but I wonder could we move the implementation to https://github.com/NVIDIA/spark-rapids-jni ?
Min/Max for a float/double reduction is something we could look into, but a group by aggregation or a window operation is more difficult. There is a lot more of a framework around them, and for group by we need the results to appear on the correct lines afterwards. If one operation is doing a hash based aggregation and the other is doing a sort based one it gets to be really complicated to make them all play nicely together. If CUDF could provide a way to plug in our own aggregations to group by or window afterwards, then it would be very interesting.