datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Implement xxhash algorithms as part of the expression API

Open HectorPascual opened this issue 11 months ago • 17 comments

Is your feature request related to a problem or challenge?

I am currently in need of using datafusion SQL query engine (through delta-rs merge operation) to hash with a specific hashing algorithm.

https://xxhash.com/ https://github.com/Cyan4973/xxHash Rust implementation : https://crates.io/crates/twox-hash

Describe the solution you'd like

I'd like the hashing functions xxhash32 and xxhash64 to be part of datafusion expression API so the functions can be used in SQL context for hashing data.

HectorPascual avatar Jan 08 '25 13:01 HectorPascual

You could also implement this as a user defined function perhaps

alamb avatar Jan 08 '25 16:01 alamb

Hi, thanks for the reply.

That is true, although, my concern comes from this other issue raised in delta-rs project (link below), since I need to use this hash operation in a Delta merge operation.

Unless I change the codebase for the delta-rs project, I'd be unable to create an UDF (not sure if there's a way to create and register it in the delta-rs context in python, sounds complicated)

https://github.com/delta-io/delta-rs/issues/3105#issuecomment-2577654666

HectorPascual avatar Jan 08 '25 17:01 HectorPascual

Can I take this up? I shall add the new functions in a new file in crypto/xxhas.rs similar to the crypto/sha224.rs file and import them in the crypto/mod.rs file. And then add some tests.

Do tell me if I'm missing something!

Spaarsh avatar Jan 19 '25 12:01 Spaarsh

take

Spaarsh avatar Jan 19 '25 17:01 Spaarsh

@Spaarsh sounds like a good plan

alamb avatar Jan 22 '25 23:01 alamb

Perhaps consider adding a non cryptographic module and add wyhash as well (here some of the crates I've used for similar usecase but in polars: https://github.com/ion-elgreco/polars-hash/blob/main/polars_hash%2FCargo.toml)

ion-elgreco avatar Jan 23 '25 11:01 ion-elgreco

Perhaps consider adding a non cryptographic module and add wyhash as well (here some of the crates I've used for similar usecase but in polars: https://github.com/ion-elgreco/polars-hash/blob/main/polars_hash%2FCargo.toml)

Sure. I'll take a look at it and get back to you in case of any doubts! Thanks!

Spaarsh avatar Jan 23 '25 13:01 Spaarsh

@HectorPascual is this the output that you expect?

> SELECT 
    xxhash32(column1) AS xxhash32_result
FROM (
    SELECT UNNEST(ARRAY[1, 2, 3, 4, 5]) AS column1
) AS subquery;
+-----------------+
| xxhash32_result |
+-----------------+
| b6ecc8b2        |
| 0c436334        |
| 9cec73c4        |
| 6dd9af36        |
| 1c5751f3        |
+-----------------+
xxhash32 output screenshot

Image

> SELECT 
    xxhash64(column1) AS xxhash64_result
FROM (
    SELECT UNNEST(ARRAY[1, 2, 3, 4, 5]) AS column1
) AS subquery;
+------------------+
| xxhash64_result  |
+------------------+
| b7b41276360564d4 |
| 6021b5621680598b |
| 26167c2af5162ca4 |
| 913914322ca46b89 |
| 6a81b47405b648ed |
+------------------+
xxhash64 output screenshot

Image

Thanks!

Spaarsh avatar Jan 26 '25 20:01 Spaarsh

@HectorPascual is this the output that you expect?

> SELECT 
    xxhash32(column1) AS xxhash32_result
FROM (
    SELECT UNNEST(ARRAY[1, 2, 3, 4, 5]) AS column1
) AS subquery;
+-----------------+
| xxhash32_result |
+-----------------+
| b6ecc8b2        |
| 0c436334        |
| 9cec73c4        |
| 6dd9af36        |
| 1c5751f3        |
+-----------------+

xxhash32 output screenshot Image

> SELECT 
    xxhash64(column1) AS xxhash64_result
FROM (
    SELECT UNNEST(ARRAY[1, 2, 3, 4, 5]) AS column1
) AS subquery;
+------------------+
| xxhash64_result  |
+------------------+
| b7b41276360564d4 |
| 6021b5621680598b |
| 26167c2af5162ca4 |
| 913914322ca46b89 |
| 6a81b47405b648ed |
+------------------+

xxhash64 output screenshot Image

Thanks!

Yes, that looks good!

HectorPascual avatar Jan 26 '25 21:01 HectorPascual

@alamb @HectorPascual I would like to have your input on this:

  1. I was wondering how to add support for a None or empty values. Should I replace the input with "NULL" string whenever a None value is entered?

  2. There is also the concern of not taking the seed from the user. Currently the seed is given a default value and I think having the user's control over it be helpful.

  3. If we decide to ask for a seed by the user, do we keep the default to be 0? Wouldn't obfuscating that default values cause issues for the user? I have added support for individual values of the types Utf8, Int32, Int64, UInt32 and UInt64.

These are the query results:

> SELECT xxhash64('1') AS xxhash64_result;
+------------------+
| xxhash64_result  |
+------------------+
| b7b41276360564d4 |
+------------------+
1 row(s) fetched. 
Elapsed 0.004 seconds.
> SELECT xxhash64(1) AS xxhash64_result;
+------------------+
| xxhash64_result  |
+------------------+
| b7b41276360564d4 |
+------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.
> SELECT xxhash32('1') AS xxhash32_result;
+-----------------+
| xxhash32_result |
+-----------------+
| b6ecc8b2        |
+-----------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.
> SELECT xxhash32(1) AS xxhash32_result;
+-----------------+
| xxhash32_result |
+-----------------+
| b6ecc8b2        |
+-----------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

If this is alright, then I will add the tests for these functions.

Perhaps consider adding a non cryptographic module and add wyhash as well (here some of the crates I've used for similar usecase but in polars: https://github.com/ion-elgreco/polars-hash/blob/main/polars_hash%2FCargo.toml)

@alamb should I open a new issue for adding wyhash and the other functions? Or they can be added along with this PR?

Spaarsh avatar Jan 27 '25 17:01 Spaarsh

Looks good, I can try tomorrow hashing the same inputs with the Python xxhash module and double check the output of the results. @Spaarsh

Let me gather some feedback from colleagues on your open concerns and come back later with details.

HectorPascual avatar Jan 27 '25 21:01 HectorPascual

Hi @Spaarsh,

The hashes match to the python module :

Image

In regards to your inputs :

  1. The python module breaks when entering a None, Image You can for instance replace None with '' (empty string)

2-3. On the seed side, for the python module the default seed is 0, see : https://github.com/ifduyue/python-xxhash#:~:text=An%20optional%20seed%20(default%20is%200)%20can%20be%20used%20to%20alter%20the%20result%20predictably%3A Is there an option to have this as an optional argument?

Thanks!

HectorPascual avatar Jan 29 '25 12:01 HectorPascual

Thanks @HectorPascual! I'm opening PR from here on for transperancy's sake!

Spaarsh avatar Jan 30 '25 03:01 Spaarsh

@HectorPascual I have added an optional seed argument and have also added support for Binary input.

Spaarsh avatar Feb 04 '25 20:02 Spaarsh

Hi @Spaarsh, I'm interested in this issue. May I give it a try?

Standing-Man avatar Aug 01 '25 09:08 Standing-Man

Hey, is anyone working on this?

I really need xxhash (and others) in DataFusion (I would especially love to use it from LanceDB which builds on top of DataFusion).

I nobody is actively working on this issue I might give it a try.

danielgafni avatar Dec 10 '25 16:12 danielgafni

Sure @danielgafni! Please go ahead. I request the maintainers to unassign this issue from me.

Spaarsh avatar Dec 11 '25 17:12 Spaarsh