feat: Add xxhash64 function support
Which issue does this PR close?
Part of #205 Closes #344
Rationale for this change
More function coverage
What changes are included in this PR?
- include twox-hash as a dep in rust
- implement xxhash64 related method in rust side
- glue code to bridge the jvm and rust
How are these changes tested?
New added test.
Thanks @advancedxy. I plan on reviewing this PR today.
Could you also update docs/source/user-guide/expressions.md to add xxhash64 as a supported expression?
Could you also update docs/source/user-guide/expressions.md to add xxhash64 as a supported expression?
Of course, I will update that among other things: such as the review comments and the inspection file: spark/inspections/CometTPCDSQueriesList-results.txt in a followup commit.
I'd like to see the tests use some randomly generated inputs.
As a quick hack, I added the following test to CometCastSuite and it shows some differences in results between Spark and Comet.
test("xxhash64") {
val input = generateStrings(timestampPattern, 8).toDF("a")
withTempPath { dir =>
val data = roundtripParquet(input, dir).coalesce(1)
data.createOrReplaceTempView("t")
val df = spark.sql(s"select a, xxhash64(a) from t order by a")
checkSparkAnswerAndOperator(df)
}
}
Some differences:
!== Correct Answer - 1000 == == Spark Answer - 1000 ==
struct<a:string,xxhash64(a):bigint> struct<a:string,xxhash64(a):bigint>
![,-7444071767201028348] [,-1205034819632174695]
![ 23,-1992628079781282865] [ 23,4312780814362028915]
![ 31T3,5857608402363468958] [ 31T3,6089516869931970265]
We could extract the generate* methods from CometCastSuite into a separate DataGenerator class that other test suites can leverage.
Our hash implementation is also not compatible with Spark. I will file an issue for that.
I'd like to see the tests use some randomly generated inputs.
As a quick hack, I added the following test to
CometCastSuiteand it shows some differences in results between Spark and Comet.test("xxhash64") { val input = generateStrings(timestampPattern, 8).toDF("a") withTempPath { dir => val data = roundtripParquet(input, dir).coalesce(1) data.createOrReplaceTempView("t") val df = spark.sql(s"select a, xxhash64(a) from t order by a") checkSparkAnswerAndOperator(df) } }Some differences:
!== Correct Answer - 1000 == == Spark Answer - 1000 == struct<a:string,xxhash64(a):bigint> struct<a:string,xxhash64(a):bigint> ![,-7444071767201028348] [,-1205034819632174695] ![ 23,-1992628079781282865] [ 23,4312780814362028915] ![ 31T3,5857608402363468958] [ 31T3,6089516869931970265]We could extract the
generate*methods fromCometCastSuiteinto a separateDataGeneratorclass that other test suites can leverage.
Good catch, and a good way to make sure the impl is correct. Let me check why the test is failing first.
Let me check why the test is failing first.
Found the issue. The create_hashes_dictionary doesn't handle input hashes correctly, it affects both murmur3hash and this new xxhash64 method.
Let me try to fix that first.
See https://github.com/apache/datafusion-comet/pull/426 for proposed DataGenerator class
Our
hashimplementation is also not compatible with Spark. I will file an issue for that.
I filed https://github.com/apache/datafusion-comet/issues/427
Our
hashimplementation is also not compatible with Spark. I will file an issue for that.I filed #427
Thanks for filing this. I think it's the same issue for both murmur3 hash and xxhash64. I will submit a pr to fix that first.
Found the issue. The create_hashes_dictionary doesn't handle input hashes correctly, it affects both murmur3hash and this new xxhash64 method.
Let me try to fix that first.
I have submitted the fix in this PR and waiting for CI passes. I will create a separate PR to include the murmur3 hash fix and depends on your #426 in the morning (in Beijing time) first.
@andygrove @viirya I have created #433 and mark this as a draft. We should merge that first and then come back to this PR . PLAL when you have tome.
@andygrove @viirya @parthchandra and @sunchao would you mind to take a look at this? I think it's ready for review.
Gently ping @andygrove @viirya, do you have any more comments?
Thanks all for reviewing, @andygrove @viirya @kazuyukitanimura @parthchandra