cudf [BUG] sort_lists is not consistent for float and double ordering

Describe the bug I was kind of caught by surprise at the comment left on a bug we are working through in the Spark plugin.

https://github.com/NVIDIA/spark-rapids/pull/6451#issuecomment-1232144326

The heuristic is in the code.

https://github.com/rapidsai/cudf/blob/9b3fdafc8d888adcebd645cd62b0473d9cd1dc34/cpp/src/lists/segmented_sort.cu#L166

And I have a test that shows that sorting negative NaN values does the wrong thing if it the average size of the list is over 100 entries.

Steps/Code to reproduce bug

val nan_1 = java.lang.Double.longBitsToDouble(0x7ff83cec2c05b870L)
val nan_2 = java.lang.Double.longBitsToDouble(0xfff5101d3f1cd31bL)
val data = (Seq(-0.0, nan_1, nan_2, Double.NaN, Double.PositiveInfinity, Double.NegativeInfinity) ++ (50.until(0, -1)).map(_.toDouble) ++ Seq(Double.NaN, Double.PositiveInfinity, Double.NegativeInfinity, nan_1, nan_2, -0.0)).toArray
System.err.println(Seq(data).toDF.repartition(1).selectExpr("sort_array(value)").collect().toList.toString)
// Prints out a result with NaNs all at the end 
val data = (Seq(0.0, -0.0, nan_1, nan_2, Double.NaN, Double.PositiveInfinity, Double.NegativeInfinity) ++ (500.until(0, -1)).map(_.toDouble) ++ Seq(Double.NaN, Double.PositiveInfinity, Double.NegativeInfinity, nan_1, nan_2, -0.0, 0.0)).toArray
System.err.println(Seq(data).toDF.repartition(1).selectExpr("sort_array(value)").collect().toList.toString)
// Prints out a list with some NaNs at the beginning of the list

I can write a C++ test if I need to.

Expected behavior Ideally the sort would be consistent no matter how many values at in the list columns.

Additional context This is not a super high priority. -NaN is a very uncommon value in practice and sorting a list of over 100 values on average is also not super common. But it would be nice to have it fixed at some point in the future.

Aug 31 '22 14:08 revans2

I was able to recreate this. The problem seems to be with negative NaN values. The divergence in the code mentioned in the description will either us the internal segmented-sort logic (when average list size is < 100) which does not differentiate between +NaN or -NaN and sorts all these to the end. For average list size >= 100, the logic instead calls the cub::DeviceSegmentedRadixSort which sorts on bit values and thereby puts -NaN to the front and +NaN to the end. This is the reason for the discrepancy in the output. I believe the desired output is to place all the NaN values at the end, right?

Sep 13 '22 23:09 davidwendt

I believe the desired output is to place all the NaN values at the end, right?

That is correct for my understanding too.

Sep 19 '22 15:09 revans2

I believe the desired output is to place all the NaN values at the end, right?

That is correct for my understanding too.

Ok. That matches what I'm seeing in other places here in libcudf too. I created PR #11703 to fix this.

Sep 19 '22 16:09 davidwendt

cudf cudf copied to clipboard

[BUG] sort_lists is not consistent for float and double ordering

cudf
cudf copied to clipboard