Support various image hash functions for deduplication.
Is your feature request related to a problem?
N/A
Describe the solution you'd like
I would like native support for various image hashing functions.
- Average hashing (aHashref)
- Perceptual hashing (pHashref)
- Difference hashing (dHashref)
- Wavelet hashing (wHashref)
- Crop-resistant hashing (crop_resistant_hashref)
Describe alternatives you've considered
I used a difference hash udf.
import imagehash
@daft.udf(return_dtype=bytes)
def image_dhash(images: Series):
"""The dhash algorithm is faster than ahash and phash.
https://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html
"""
return [imagehash.dhash(Image.fromarray(img)).hash.tobytes() for img in images]
Additional Context
Makes it easy to deduplicate images by using this hash in a groupby + anyvalue.
Would you like to implement a fix?
No
Hi, I'd love to take a stab at this if possible :)
@fool1280 thanks, assigned!
Yo @fool1280, you still working on this issue? If ok I would like to contribute.
@codekshitij assigning you, thanks for taking a look 🙏
Hi @rchowell! I've started working on this issue and wanted to share my approach.
My Implementation Plan:
-
Starting with
average_hashas the foundation, then extending to the other 4 hash types -
Following existing patterns - I've analyzed the codebase structure and will implement using the same patterns as other image functions
-
Target files to modify:
src/daft-image/src/functions/hash.rs- Core hash implementationssrc/daft-image/src/series.rs- Series-level operationsdaft/expressions/expressions.py- Python API bindings- Tests in existing
test_image.pyfile
Technical Approach:
-
Using the standard ScalarUDF pattern like other image functions
-
Algorithm: grayscale → 8x8 resize → average threshold → binary hash
-
Output: 64-character binary strings for deduplication use cases
Questions:
-
Any preference for hash output format? (binary string vs hex string)
-
Should I include Hamming distance utilities for comparing hashes?
Thanks for the clear issue description! The Python imagehash reference implementation is very helpful.
@codekshitij hash output should be a binary string, no need to add additional functionality, but if you want to then go for it. Thanks.
Hi @rchowell!
I've successfully implemented the average hash function and all tests are passing.
Question: Would you prefer that I:
- Create a draft PR now with just the average hash implementation for early feedback, then add the remaining hash types in follow-up commits, or
- Wait and create a single PR with all 5 hash types implemented?
I'm leaning toward option 1 since it would allow for early feedback on my implementation approach, but wanted to check your preference.
The current average hash implementation follows the existing codebase patterns and returns 64-character binary strings as requested.
Thanks!
Hey, @rchowell just made a PR for this issue.