imagehash icon indicating copy to clipboard operation
imagehash copied to clipboard

Use colorhash to find similarity in percentage

Open EBakirdinov opened this issue 4 months ago • 5 comments

My question is can i use colorhash to find similarity of image in percentage.

Example:

test = imagehash.colorhash(Image.open(path1), binbits=64)
test_2 = imagehash.colorhash(Image.open(path2), binbits=64)

print(test - test_2)

let's imagine i get 75. But the question what is the max possible value for this two images. Is it 80 so my images are not similar or is it 800 so my images are quite similar.

EBakirdinov avatar Feb 28 '24 08:02 EBakirdinov

The code is here https://github.com/JohannesBuchner/imagehash/blob/master/imagehash/init.py#L435

It computes a few numbers (14) for black, gray, and 6 histogram bins for faint and bright colors each. The numbers are between 0 and 2^binbits-1. The bits of these are then flattened into a single, large array of binary numbers.

The subtraction operation is here: https://github.com/JohannesBuchner/imagehash/blob/master/imagehash/init.py#L111 It counts the number of different bits.

So I guess the maximum possible is binbits*14?

JohannesBuchner avatar Feb 28 '24 10:02 JohannesBuchner

Similar images should have a small difference.

This function is designed with small binbits (default=3) in mind. If the number is way different, all 3 bits are likely different, while if they are similar, likely only one or two (the least significant bits) are different. This does not have to be true (in digits, 9 vs 10 has 2 differences, while the numbers are actually close together), so it is not ideal. But if you choose binbits=64, then counting the number of different bits is not a good approach, and does not really group quite similar things together.

All that said, the colorhash is just one possible implementation, and there are probably better approaches.

JohannesBuchner avatar Feb 28 '24 10:02 JohannesBuchner

@JohannesBuchner I'm little bit confused. Im new to it. For example i have one blank black image and one blank white image and binbits 32. At the end i'm getting 128. Shouldn't it be 448 (32 * 14)?

EBakirdinov avatar Feb 28 '24 11:02 EBakirdinov

Similar images should have a small difference.

This function is designed with small binbits (default=3) in mind. If the number is way different, all 3 bits are likely different, while if they are similar, likely only one or two (the least significant bits) are different. This does not have to be true (in digits, 9 vs 10 has 2 differences, while the numbers are actually close together), so it is not ideal. But if you choose binbits=64, then counting the number of different bits is not a good approach, and does not really group quite similar things together.

All that said, the colorhash is just one possible implementation, and there are probably better approaches.

Ooh. So working with high binbits is not that efficient?

EBakirdinov avatar Feb 28 '24 11:02 EBakirdinov

@JohannesBuchner I'm little bit confused. Im new to it. For example i have one blank black image and one blank white image and binbits 32. At the end i'm getting 128. Shouldn't it be 448 (32 * 14)?

Maybe copy the function code of colorhash and run it line by line for an example image, and look at the variables. frac_black and frac_gray are probably as you expect, but I am not sure about h_bright_counts and h_faint_counts.

JohannesBuchner avatar Feb 28 '24 11:02 JohannesBuchner