Daft icon indicating copy to clipboard operation
Daft copied to clipboard

Support various image hash functions for deduplication.

Open rchowell opened this issue 4 months ago • 2 comments

Is your feature request related to a problem?

N/A

Describe the solution you'd like

I would like native support for various image hashing functions.

Describe alternatives you've considered

I used a difference hash udf.

import imagehash

@daft.udf(return_dtype=bytes)
def image_dhash(images: Series):
    """The dhash algorithm is faster than ahash and phash.

    https://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html
    """
    return [imagehash.dhash(Image.fromarray(img)).hash.tobytes() for img in images]

Additional Context

Makes it easy to deduplicate images by using this hash in a groupby + anyvalue.

Would you like to implement a fix?

No

rchowell avatar Aug 01 '25 00:08 rchowell

Hi, I'd love to take a stab at this if possible :)

fool1280 avatar Aug 24 '25 09:08 fool1280

@fool1280 thanks, assigned!

rchowell avatar Aug 26 '25 20:08 rchowell

Yo @fool1280, you still working on this issue? If ok I would like to contribute.

codekshitij avatar Sep 09 '25 23:09 codekshitij

@codekshitij assigning you, thanks for taking a look 🙏

rchowell avatar Sep 10 '25 00:09 rchowell

Hi @rchowell! I've started working on this issue and wanted to share my approach.

My Implementation Plan:

  1. Starting with average_hash as the foundation, then extending to the other 4 hash types

  2. Following existing patterns - I've analyzed the codebase structure and will implement using the same patterns as other image functions

  3. Target files to modify:

    • src/daft-image/src/functions/hash.rs - Core hash implementations
    • src/daft-image/src/series.rs - Series-level operations
    • daft/expressions/expressions.py - Python API bindings
    • Tests in existing test_image.py file

Technical Approach:

  • Using the standard ScalarUDF pattern like other image functions

  • Algorithm: grayscale → 8x8 resize → average threshold → binary hash

  • Output: 64-character binary strings for deduplication use cases

    Questions:

  • Any preference for hash output format? (binary string vs hex string)

  • Should I include Hamming distance utilities for comparing hashes?

Thanks for the clear issue description! The Python imagehash reference implementation is very helpful.

codekshitij avatar Sep 10 '25 05:09 codekshitij

@codekshitij hash output should be a binary string, no need to add additional functionality, but if you want to then go for it. Thanks.

rchowell avatar Sep 10 '25 16:09 rchowell

Hi @rchowell!

I've successfully implemented the average hash function and all tests are passing.

Question: Would you prefer that I:

  1. Create a draft PR now with just the average hash implementation for early feedback, then add the remaining hash types in follow-up commits, or
  2. Wait and create a single PR with all 5 hash types implemented?

I'm leaning toward option 1 since it would allow for early feedback on my implementation approach, but wanted to check your preference.

The current average hash implementation follows the existing codebase patterns and returns 64-character binary strings as requested.

Thanks!

codekshitij avatar Sep 12 '25 06:09 codekshitij

Hey, @rchowell just made a PR for this issue.

codekshitij avatar Sep 18 '25 00:09 codekshitij