datafusion-comet
datafusion-comet copied to clipboard
fuzz test failure: `corr` null vs Nan
Describe the bug
SQL
SELECT c3, c42, corr(c20, c6) FROM test0 GROUP BY c3,c42 ORDER BY c3, c42;
Spark Plan
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
*(3) Sort [c3#3 ASC NULLS FIRST, c42#42 ASC NULLS FIRST], true, 0
+- AQEShuffleRead coalesced
+- ShuffleQueryStage 1
+- Exchange rangepartitioning(c3#3 ASC NULLS FIRST, c42#42 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=91129]
+- *(2) HashAggregate(keys=[c3#3, c42#42], functions=[corr(c20#20, c6#6)], output=[c3#3, c42#42, corr(c20, c6)#28057])
+- AQEShuffleRead coalesced
+- ShuffleQueryStage 0
+- Exchange hashpartitioning(c3#3, c42#42, 200), ENSURE_REQUIREMENTS, [plan_id=91101]
+- *(1) HashAggregate(keys=[c3#3, c42#42], functions=[partial_corr(c20#20, c6#6)], output=[c3#3, c42#42, n#28038, xAvg#28039, yAvg#28040, ck#28041, xMk#28042, yMk#28043])
+- *(1) ColumnarToRow
+- FileScan parquet [c3#3,c6#6,c20#20,c42#42] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/home/andy/git/apache/datafusion-comet/fuzz-testing/test0.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c3:int,c6:double,c20:double,c42:array<timestamp_ntz>>
+- == Initial Plan ==
Sort [c3#3 ASC NULLS FIRST, c42#42 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(c3#3 ASC NULLS FIRST, c42#42 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=91083]
+- HashAggregate(keys=[c3#3, c42#42], functions=[corr(c20#20, c6#6)], output=[c3#3, c42#42, corr(c20, c6)#28057])
+- Exchange hashpartitioning(c3#3, c42#42, 200), ENSURE_REQUIREMENTS, [plan_id=91080]
+- HashAggregate(keys=[c3#3, c42#42], functions=[partial_corr(c20#20, c6#6)], output=[c3#3, c42#42, n#28038, xAvg#28039, yAvg#28040, ck#28041, xMk#28042, yMk#28043])
+- FileScan parquet [c3#3,c6#6,c20#20,c42#42] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/home/andy/git/apache/datafusion-comet/fuzz-testing/test0.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c3:int,c6:double,c20:double,c42:array<timestamp_ntz>>
Comet Plan
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
*(1) CometColumnarToRow
+- CometSort [c3#3, c42#42, corr(c20, c6)#28174], [c3#3 ASC NULLS FIRST, c42#42 ASC NULLS FIRST]
+- AQEShuffleRead coalesced
+- ShuffleQueryStage 1
+- CometColumnarExchange rangepartitioning(c3#3 ASC NULLS FIRST, c42#42 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, CometColumnarShuffle, [plan_id=91263]
+- CometHashAggregate [c3#3, c42#42, n#28155, xAvg#28156, yAvg#28157, ck#28158, xMk#28159, yMk#28160], Final, [c3#3, c42#42], [corr(c20#20, c6#6)]
+- AQEShuffleRead coalesced
+- ShuffleQueryStage 0
+- CometColumnarExchange hashpartitioning(c3#3, c42#42, 200), ENSURE_REQUIREMENTS, CometColumnarShuffle, [plan_id=91217]
+- CometHashAggregate [c3#3, c6#6, c20#20, c42#42], Partial, [c3#3, c42#42], [partial_corr(c20#20, c6#6)]
+- CometScan [native_iceberg_compat] parquet [c3#3,c6#6,c20#20,c42#42] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/home/andy/git/apache/datafusion-comet/fuzz-testing/test0.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c3:int,c6:double,c20:double,c42:array<timestamp_ntz>>
+- == Initial Plan ==
CometSort [c3#3, c42#42, corr(c20, c6)#28174], [c3#3 ASC NULLS FIRST, c42#42 ASC NULLS FIRST]
+- CometColumnarExchange rangepartitioning(c3#3 ASC NULLS FIRST, c42#42 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, CometColumnarShuffle, [plan_id=91198]
+- CometHashAggregate [c3#3, c42#42, n#28155, xAvg#28156, yAvg#28157, ck#28158, xMk#28159, yMk#28160], Final, [c3#3, c42#42], [corr(c20#20, c6#6)]
+- CometColumnarExchange hashpartitioning(c3#3, c42#42, 200), ENSURE_REQUIREMENTS, CometColumnarShuffle, [plan_id=91196]
+- CometHashAggregate [c3#3, c6#6, c20#20, c42#42], Partial, [c3#3, c42#42], [partial_corr(c20#20, c6#6)]
+- CometScan [native_iceberg_compat] parquet [c3#3,c6#6,c20#20,c42#42] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/home/andy/git/apache/datafusion-comet/fuzz-testing/test0.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c3:int,c6:double,c20:double,c42:array<timestamp_ntz>>
First difference at row 150:
Spark: 1190973260,[3333-01-21T01:11:48.781],NULL
Comet: 1190973260,[3333-01-21T01:11:48.781],NaN
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response
I'm checking this
Repro:
test("corr - nan/null") {
withTable("t1") {
sql("""create table t1 using parquet as
select cast(null as float) f1, CAST('NaN' AS float) f2, cast(null as double) d1, CAST('NaN' AS double) d2
from range(1)
""")
checkSparkAnswerAndOperator(
"""
|select
| corr(f1, f2) c1,
| corr(f1, f1) c2,
| corr(f2, f1) c3,
| corr(f2, f2) c4,
| corr(d1, d2) c5,
| corr(d1, d1) c6,
| corr(d2, d1) c7,
| corr(d2, d2) c8
| FROM t1""".stripMargin)
}
}
== Results ==
!== Correct Answer - 1 == == Spark Answer - 1 ==
struct<c1:double,c2:double,c3:double,c4:double,c5:double,c6:double,c7:double,c8:double> struct<c1:double,c2:double,c3:double,c4:double,c5:double,c6:double,c7:double,c8:double>
![null,null,null,null,null,null,null,null] [null,null,null,NaN,null,null,null,NaN]
https://github.com/apache/datafusion/issues/18659
Merged will be fixed in next DF release. @andygrove should we keep this ticket in 0.12.0?