Implement xxhash algorithms as part of the expression API
Is your feature request related to a problem or challenge?
I am currently in need of using datafusion SQL query engine (through delta-rs merge operation) to hash with a specific hashing algorithm.
https://xxhash.com/ https://github.com/Cyan4973/xxHash Rust implementation : https://crates.io/crates/twox-hash
Describe the solution you'd like
I'd like the hashing functions xxhash32 and xxhash64 to be part of datafusion expression API so the functions can be used in SQL context for hashing data.
You could also implement this as a user defined function perhaps
Hi, thanks for the reply.
That is true, although, my concern comes from this other issue raised in delta-rs project (link below), since I need to use this hash operation in a Delta merge operation.
Unless I change the codebase for the delta-rs project, I'd be unable to create an UDF (not sure if there's a way to create and register it in the delta-rs context in python, sounds complicated)
https://github.com/delta-io/delta-rs/issues/3105#issuecomment-2577654666
Can I take this up? I shall add the new functions in a new file in crypto/xxhas.rs similar to the crypto/sha224.rs file and import them in the crypto/mod.rs file. And then add some tests.
Do tell me if I'm missing something!
take
@Spaarsh sounds like a good plan
Perhaps consider adding a non cryptographic module and add wyhash as well (here some of the crates I've used for similar usecase but in polars: https://github.com/ion-elgreco/polars-hash/blob/main/polars_hash%2FCargo.toml)
Perhaps consider adding a non cryptographic module and add wyhash as well (here some of the crates I've used for similar usecase but in polars: https://github.com/ion-elgreco/polars-hash/blob/main/polars_hash%2FCargo.toml)
Sure. I'll take a look at it and get back to you in case of any doubts! Thanks!
@HectorPascual is this the output that you expect?
> SELECT
xxhash32(column1) AS xxhash32_result
FROM (
SELECT UNNEST(ARRAY[1, 2, 3, 4, 5]) AS column1
) AS subquery;
+-----------------+
| xxhash32_result |
+-----------------+
| b6ecc8b2 |
| 0c436334 |
| 9cec73c4 |
| 6dd9af36 |
| 1c5751f3 |
+-----------------+
xxhash32 output screenshot
> SELECT
xxhash64(column1) AS xxhash64_result
FROM (
SELECT UNNEST(ARRAY[1, 2, 3, 4, 5]) AS column1
) AS subquery;
+------------------+
| xxhash64_result |
+------------------+
| b7b41276360564d4 |
| 6021b5621680598b |
| 26167c2af5162ca4 |
| 913914322ca46b89 |
| 6a81b47405b648ed |
+------------------+
xxhash64 output screenshot
@HectorPascual is this the output that you expect?
> SELECT xxhash32(column1) AS xxhash32_result FROM ( SELECT UNNEST(ARRAY[1, 2, 3, 4, 5]) AS column1 ) AS subquery; +-----------------+ | xxhash32_result | +-----------------+ | b6ecc8b2 | | 0c436334 | | 9cec73c4 | | 6dd9af36 | | 1c5751f3 | +-----------------+xxhash32 output screenshot
> SELECT xxhash64(column1) AS xxhash64_result FROM ( SELECT UNNEST(ARRAY[1, 2, 3, 4, 5]) AS column1 ) AS subquery; +------------------+ | xxhash64_result | +------------------+ | b7b41276360564d4 | | 6021b5621680598b | | 26167c2af5162ca4 | | 913914322ca46b89 | | 6a81b47405b648ed | +------------------+xxhash64 output screenshot
Thanks!
Yes, that looks good!
@alamb @HectorPascual I would like to have your input on this:
-
I was wondering how to add support for a
Noneor empty values. Should I replace the input with "NULL" string whenever aNonevalue is entered? -
There is also the concern of not taking the seed from the user. Currently the seed is given a default value and I think having the user's control over it be helpful.
-
If we decide to ask for a seed by the user, do we keep the default to be 0? Wouldn't obfuscating that default values cause issues for the user? I have added support for individual values of the types
Utf8,Int32,Int64,UInt32andUInt64.
These are the query results:
> SELECT xxhash64('1') AS xxhash64_result;
+------------------+
| xxhash64_result |
+------------------+
| b7b41276360564d4 |
+------------------+
1 row(s) fetched.
Elapsed 0.004 seconds.
> SELECT xxhash64(1) AS xxhash64_result;
+------------------+
| xxhash64_result |
+------------------+
| b7b41276360564d4 |
+------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> SELECT xxhash32('1') AS xxhash32_result;
+-----------------+
| xxhash32_result |
+-----------------+
| b6ecc8b2 |
+-----------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> SELECT xxhash32(1) AS xxhash32_result;
+-----------------+
| xxhash32_result |
+-----------------+
| b6ecc8b2 |
+-----------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
If this is alright, then I will add the tests for these functions.
Perhaps consider adding a non cryptographic module and add wyhash as well (here some of the crates I've used for similar usecase but in polars: https://github.com/ion-elgreco/polars-hash/blob/main/polars_hash%2FCargo.toml)
@alamb should I open a new issue for adding wyhash and the other functions? Or they can be added along with this PR?
Looks good, I can try tomorrow hashing the same inputs with the Python xxhash module and double check the output of the results. @Spaarsh
Let me gather some feedback from colleagues on your open concerns and come back later with details.
Hi @Spaarsh,
The hashes match to the python module :
In regards to your inputs :
- The python module breaks when entering a None,
You can for instance replace None with '' (empty string)
2-3. On the seed side, for the python module the default seed is 0, see : https://github.com/ifduyue/python-xxhash#:~:text=An%20optional%20seed%20(default%20is%200)%20can%20be%20used%20to%20alter%20the%20result%20predictably%3A Is there an option to have this as an optional argument?
Thanks!
Thanks @HectorPascual! I'm opening PR from here on for transperancy's sake!
@HectorPascual I have added an optional seed argument and have also added support for Binary input.
Hi @Spaarsh, I'm interested in this issue. May I give it a try?
Hey, is anyone working on this?
I really need xxhash (and others) in DataFusion (I would especially love to use it from LanceDB which builds on top of DataFusion).
I nobody is actively working on this issue I might give it a try.
Sure @danielgafni! Please go ahead. I request the maintainers to unassign this issue from me.