datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

feat: Add xxhash64 function support

Open advancedxy opened this issue 1 year ago • 11 comments

Which issue does this PR close?

Part of #205 Closes #344

Rationale for this change

More function coverage

What changes are included in this PR?

  1. include twox-hash as a dep in rust
  2. implement xxhash64 related method in rust side
  3. glue code to bridge the jvm and rust

How are these changes tested?

New added test.

advancedxy avatar May 14 '24 13:05 advancedxy

Thanks @advancedxy. I plan on reviewing this PR today.

Could you also update docs/source/user-guide/expressions.md to add xxhash64 as a supported expression?

andygrove avatar May 14 '24 13:05 andygrove

Could you also update docs/source/user-guide/expressions.md to add xxhash64 as a supported expression?

Of course, I will update that among other things: such as the review comments and the inspection file: spark/inspections/CometTPCDSQueriesList-results.txt in a followup commit.

advancedxy avatar May 14 '24 14:05 advancedxy

I'd like to see the tests use some randomly generated inputs.

As a quick hack, I added the following test to CometCastSuite and it shows some differences in results between Spark and Comet.

  test("xxhash64") {
    val input = generateStrings(timestampPattern, 8).toDF("a")
    withTempPath { dir =>
      val data = roundtripParquet(input, dir).coalesce(1)
      data.createOrReplaceTempView("t")
      val df = spark.sql(s"select a, xxhash64(a) from t order by a")
      checkSparkAnswerAndOperator(df)
    }
  }

Some differences:

!== Correct Answer - 1000 ==           == Spark Answer - 1000 ==
 struct<a:string,xxhash64(a):bigint>   struct<a:string,xxhash64(a):bigint>
![,-7444071767201028348]               [,-1205034819632174695]
![	 23,-1992628079781282865]           [	 23,4312780814362028915]
![	31T3,5857608402363468958]           [	31T3,6089516869931970265]

We could extract the generate* methods from CometCastSuite into a separate DataGenerator class that other test suites can leverage.

andygrove avatar May 14 '24 14:05 andygrove

Our hash implementation is also not compatible with Spark. I will file an issue for that.

andygrove avatar May 14 '24 14:05 andygrove

I'd like to see the tests use some randomly generated inputs.

As a quick hack, I added the following test to CometCastSuite and it shows some differences in results between Spark and Comet.

  test("xxhash64") {
    val input = generateStrings(timestampPattern, 8).toDF("a")
    withTempPath { dir =>
      val data = roundtripParquet(input, dir).coalesce(1)
      data.createOrReplaceTempView("t")
      val df = spark.sql(s"select a, xxhash64(a) from t order by a")
      checkSparkAnswerAndOperator(df)
    }
  }

Some differences:

!== Correct Answer - 1000 ==           == Spark Answer - 1000 ==
 struct<a:string,xxhash64(a):bigint>   struct<a:string,xxhash64(a):bigint>
![,-7444071767201028348]               [,-1205034819632174695]
![	 23,-1992628079781282865]           [	 23,4312780814362028915]
![	31T3,5857608402363468958]           [	31T3,6089516869931970265]

We could extract the generate* methods from CometCastSuite into a separate DataGenerator class that other test suites can leverage.

Good catch, and a good way to make sure the impl is correct. Let me check why the test is failing first.

advancedxy avatar May 14 '24 14:05 advancedxy

Let me check why the test is failing first.

Found the issue. The create_hashes_dictionary doesn't handle input hashes correctly, it affects both murmur3hash and this new xxhash64 method.

Let me try to fix that first.

advancedxy avatar May 14 '24 15:05 advancedxy

See https://github.com/apache/datafusion-comet/pull/426 for proposed DataGenerator class

andygrove avatar May 14 '24 15:05 andygrove

Our hash implementation is also not compatible with Spark. I will file an issue for that.

I filed https://github.com/apache/datafusion-comet/issues/427

andygrove avatar May 14 '24 15:05 andygrove

Our hash implementation is also not compatible with Spark. I will file an issue for that.

I filed #427

Thanks for filing this. I think it's the same issue for both murmur3 hash and xxhash64. I will submit a pr to fix that first.

advancedxy avatar May 14 '24 15:05 advancedxy

Found the issue. The create_hashes_dictionary doesn't handle input hashes correctly, it affects both murmur3hash and this new xxhash64 method.

Let me try to fix that first.

I have submitted the fix in this PR and waiting for CI passes. I will create a separate PR to include the murmur3 hash fix and depends on your #426 in the morning (in Beijing time) first.

advancedxy avatar May 14 '24 16:05 advancedxy

@andygrove @viirya I have created #433 and mark this as a draft. We should merge that first and then come back to this PR . PLAL when you have tome.

advancedxy avatar May 15 '24 12:05 advancedxy

@andygrove @viirya @parthchandra and @sunchao would you mind to take a look at this? I think it's ready for review.

advancedxy avatar May 28 '24 01:05 advancedxy

Gently ping @andygrove @viirya, do you have any more comments?

advancedxy avatar Jun 03 '24 02:06 advancedxy

Thanks all for reviewing, @andygrove @viirya @kazuyukitanimura @parthchandra

advancedxy avatar Jun 04 '24 02:06 advancedxy