IPED icon indicating copy to clipboard operation
IPED copied to clipboard

Enhance Image Similarity Search

Open lfcnassif opened this issue 5 years ago • 6 comments

It is possible to run images through some deeplearning classification algorithm pre-trained with some generic and large data set (like ImageNet) and take the last layer activations (before the softmax output layer) as rich feature vectors. Then these vectors could be stored in index and used by the nearest neighbours look up instead of the current features generated, if the user has the needed dependencies installed.

Maybe doing batch image processing (like described at #357) could make processing faster and feasible to run on the CPU.

There are a number of pre-trained algorithms and weights available both in python and java (https://deeplearning4j.konduit.ai/model-zoo/zoo-models)

This idea was proposed by @joaomacedo and is also present in Andrew NG's coursera courses.

lfcnassif avatar Feb 01 '21 21:02 lfcnassif

Hi everyone. I will register here one use case that I consider important, so it can be used to validate future implementation of this issue.

Different IPED parsers find thumbs of images. In a recent case, Telegram parser found a very low resolution thumb of the image sent by the cellphone user via Telegram. Unfortunatelly the original image could not be found anymore, and the lowres thumb was of no use. But I decided to try IPED similar image search. It returned me 4134 results. As persistent as I can be, I visually looked for the similar one, one by one, and could find it. It was not the original, but another thumb made by a different app, File manager plus, but this time a way bigger resolution version. Though, I could find it at position 3937 of 4134, i. e., because of my persistence (and the great help of the tool also, of course).

So, I suggest that future newer implementation take as validation cases to test real lowres thumb of images, in a away to assure this different resolution thumbs of the same image get a greater score in the similar image search.

patrickdalla avatar Apr 26 '25 16:04 patrickdalla

Thanks @patrickdalla, this use case is really interesting and should be also tested.

I've already also considered to implement a photoDNA similar image search. We search for it in CSAM databases, but we could also use it to search within the case, like current @wladimirleite's similar image search. Another interesting algorithm is Facebook PDQ, it is better than photoDNA to search for very small thumbnails, although AFAIK it is not used in CSAM databases and photoDNA is a bit more robust to image cropping.

lfcnassif avatar Apr 26 '25 18:04 lfcnassif

Hi @wladimirleite!

I experimented with https://github.com/christiansafka/img2vec/ library to try to tackle this, using its smaller and its bigger EfficientNet models. Unfortunately, it's giving worse results than your original image similarity algorithm. When you have some time, could you take a look at commit b863c88 to see if I forgot to adjust something obvious? I tried to make the 4 channel color filter and the evalCut no-op, for testing, and adjusted the distance computation to take all features into account. Not sure if the re-scaling of the float features to [-128, 127] byte range could be an issue in ImageSimilarityTask line 127...

PS: Sorry for the debugging code mixed together, I can remove it if you want. It is also very slow for now, I want to see if the POC works before focusing on optimizations or stability.

lfcnassif avatar Aug 18 '25 23:08 lfcnassif

I took a quick look, and the code seems fine. I will take a closer look, and maybe try to run it, to see if I find anything suspicious. About the line you mentioned, what is the useful range of "f" variable?

wladimirleite avatar Aug 19 '25 00:08 wladimirleite

Thank you @wladimirleite!

About the line you mentioned, what is the useful range of "f" variable?

Not sure, it changes depending on the selected model and data set, I just tried to make it use as many values of the [-128, 127] byte range as possible. I tried some models, I'll adjust it to efficientnet_b7 and push.

PS: If you try to run it, you must pip install img2vec_pytorch pillow numpy==1.26.4 into IPED's python.

lfcnassif avatar Aug 19 '25 00:08 lfcnassif

Just saw the library example uses the cosine similarity, I'll give it a try.

lfcnassif avatar Aug 19 '25 03:08 lfcnassif