spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] Don't use `.min()` and `.max()` methods on floating-point columns if there's `NaN`

Open ttnghia opened this issue 3 years ago • 10 comments

The methods min and max operated on floating-point columns do not output the results that Spark wants. In particular, if there are NaNs in the input, the output of these methods may be different from the Spark's desired output.

As such, when we need to compute min/max values, do not use these min and max methods on floating-point data types unless we know that there's no NaN.

ttnghia avatar Sep 27 '22 19:09 ttnghia

@ttnghia are there specific areas where min and max are being used, or are there test cases that are missing this?

sameerz avatar Sep 27 '22 20:09 sameerz

Not now, but we just hit it in a test for https://github.com/NVIDIA/spark-rapids/issues/6130.

I file this issue so we can be more aware in the future of this case.

ttnghia avatar Sep 27 '22 20:09 ttnghia

So do you want us to document this somewhere? I am just trying to understand what the result of this is. If it is more awareness, then okay we can close this and tell everyone not to do it. If you think there might be bugs in the existing code, then we can go through the code and verify that they are all covered.

revans2 avatar Sep 27 '22 20:09 revans2

So do you want us to document this somewhere?

Yes, I think so. We can call this a bug since we can get wrong results (compared to Spark's results) with min and max on floating point columns, but I don't know how to fix it because these min and max are from cudf JNI, not from our plugin code. If we are allowed to fix it there then that's great. Otherwise, we should document this clearly.

ttnghia avatar Sep 27 '22 20:09 ttnghia

Is the bug related to GpuMax or GpuMin? I guess we have added a workaround way to match the Spark's behavior: https://github.com/NVIDIA/spark-rapids/pull/5989

HaoYang670 avatar Sep 29 '22 11:09 HaoYang670

These are ColumnVector.min() and ColumnVector.max() which are Java functions from cudf, not the plugin code. These functions are only called internally, not the ones that SQL users execute.

ttnghia avatar Sep 29 '22 13:09 ttnghia

Should we move the NaN handling (what we have done in GpuMax and GpuMin) into cuDF or rapids-JNI ?

HaoYang670 avatar Sep 29 '22 13:09 HaoYang670

That is also part of my question above. Moving the implementation to cudf JNI will make GpuMin/Max and min/max consistent but cudf JNI may be used somewhere else not just our plugin thus I'm not sure.

ttnghia avatar Sep 29 '22 13:09 ttnghia

Not familiar with JNI, but I wonder could we move the implementation to https://github.com/NVIDIA/spark-rapids-jni ?

HaoYang670 avatar Sep 29 '22 13:09 HaoYang670

Not familiar with JNI, but I wonder could we move the implementation to https://github.com/NVIDIA/spark-rapids-jni ?

Min/Max for a float/double reduction is something we could look into, but a group by aggregation or a window operation is more difficult. There is a lot more of a framework around them, and for group by we need the results to appear on the correct lines afterwards. If one operation is doing a hash based aggregation and the other is doing a sort based one it gets to be really complicated to make them all play nicely together. If CUDF could provide a way to plug in our own aggregations to group by or window afterwards, then it would be very interesting.

revans2 avatar Sep 30 '22 17:09 revans2