datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Comet sort order different to Spark for 0.0 and -0.0

Open andygrove opened this issue 1 year ago • 5 comments

Describe the bug

During testing with CAST, we execute queries with ORDER by a and it seems the results are ordered differently for 0.0 vs -0.0 with float and double.

![0.0,0.0]                                [-0.0,-0.0]
![-0.0,-0.0]                              [0.0,0.0]

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

andygrove avatar Apr 29 '24 23:04 andygrove

Scala/Java seems to take the sign into account during sorting and Rust does not. Perhaps we just need to document this as an edge case in the compatibility guide.

Scala

val x: Seq[Float] = Seq(0.0f, -0.0f, 0.0f, -0.0f, 1.0f, -1.0f)
println(x.sorted)

Output: List(-1.0, -0.0, -0.0, 0.0, 0.0, 1.0)

Rust

let mut v = vec![0.0_f32, -0.0_f32, 0.0_f32, -0.0_f32, 1.0_f32, -1.0_f32];
v.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Greater));
println!("{:?}", v);

Output: [-1.0, 0.0, -0.0, 0.0, -0.0, 1.0]

andygrove avatar Apr 30 '24 16:04 andygrove

We should also test with NaN in sorting

andygrove avatar May 01 '24 13:05 andygrove

TIL: IEEE 754 says that +0.0 and -0.0 are equal and in C and Java +0.0 == -0.0 is true but in Java the Double.equals does not treat the two as equal. https://en.wikipedia.org/wiki/Signed_zero#Comparisons

I don't think SQL distinguishes between the two so I would not consider this an issue that needs to be fixed.

parthchandra avatar May 02 '24 22:05 parthchandra

We should also test with NaN in sorting

Also +infinity and -infinity while we are at it.

parthchandra avatar May 02 '24 23:05 parthchandra

It is probably worth having a section in the compatibility guide specifically for Rust vs Java differences like this.

andygrove avatar May 04 '24 12:05 andygrove

I found that Spark ORDER BY is not stable sort indicated by https://issues.apache.org/jira/browse/SPARK-45243 Datafusion sort is not stable either https://github.com/apache/datafusion/blob/9b4f90ad1eefabdc0d5bbbfd99e58765b041bb77/datafusion/physical-plan/src/sorts/sort.rs#L601 SQL standard does not seem to guarantee the stabiility https://stackoverflow.com/questions/15522746/is-sql-order-by-clause-guaranteed-to-be-stable-by-standards

I see sometimes Spark and Datafusion sort order becomes same for 0.0 and -0.0. I agree that I would not worry about it

kazuyukitanimura avatar Sep 27 '24 08:09 kazuyukitanimura

It is working fine for NaN +infinity, and -infinity

The only thing is that it seems Spark always sorts 0.0 first before -0.0 that does not make sense. Perhaps this is a Spark issue.

kazuyukitanimura avatar Sep 27 '24 08:09 kazuyukitanimura