datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

bug: hash expression is not consistent with Spark

Open andygrove opened this issue 9 months ago • 0 comments

Describe the bug

Our hash implementation does not produce the same results as Spark for some inputs.

I added this test to CometCastSuite because that's where we have random data generators (we should move them into a common class that more test suites can use).

  test("hash") {
    val input = generateStrings(timestampPattern, 8).toDF("a")
    withTempPath { dir =>
      val data = roundtripParquet(input, dir).coalesce(1)
      data.createOrReplaceTempView("t")
      val df = spark.sql(s"select a, hash(a) from t order by a")
      checkSparkAnswerAndOperator(df)
    }
  }

Example output:

!== Correct Answer - 1000 ==    == Spark Answer - 1000 ==
 struct<a:string,hash(a):int>   struct<a:string,hash(a):int>
![,142593372]                   [,0]
![	099,-1611881412]             [	099,-881749019]
![	1 474,240523873]             [	1 474,-1111423867]
![	12852,-1057581169]           [	12852,-404859411]
![	18,-492750382]               [	18,1333608017]

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

andygrove avatar May 14 '24 15:05 andygrove