Migrate PhotoDNA look up structure from VPTrees to better algorithm
Actually vptrees need a metric to be used as distance function, but the used distance function (squared euclidean distance) is not (does not satisfy the triangle inequality), there is a warning in PhotoDNALookupTask about that. It usually works, but sometimes a needed sub-tree is not inspected, possibly missing matches. I tried to use euclidean distance in the past, but that causes most sub-trees to be inspected, resulting in very large search times.
This great comparison work says HNSW is a great algorithm: https://github.com/erikbern/ann-benchmarks
I found https://github.com/jelmerk/hnswlib months ago, but just today a license file was added clearly. It is Apache :)
Just to note, I tried to use some java LSH implementations in the past, but was not able to get them to work, but it seems a good algo too.
Before diving into work on #2565, I'm exploring alternatives for this issue.
Currently, our internal database contains approximately 10 million PhotoDNA hashes. These take about 2 seconds to load from the cache file stored on a fast SSD, plus an additional 50 seconds to build the VPTree structures on my workstation (which has 48 threads). My initial thought was to also load PhotoDNA hashes from known video frames, which would add roughly 5 million more hashes.
I'm concerned about the potential increase in loading time and memory footprint. While the current figures seem acceptable, they will undoubtedly grow. A key limitation of the VPTrees library is the lack of save/load functionality, meaning the structure has to be rebuilt every time.
I ran a standalone test using hnswlib, a library @lfcnassif mentioned years ago in this thread. It appears to be a robust and actively maintained library. However, it took around 12 minutes to build its index for 10 million hashes. Although it offers out-of-the-box save/load functions, these also seem somewhat slow, taking about 8 minutes to save (resulting in a ~2.5 GB file) and 7 minutes to load. While there are various parameters that might speed things up, I've already tested several configurations, and the results I've shared are from the most optimized setup.
What do you think about our usage of it, without using a metric as distance, violating a VPTree requirement? In the past, I talked to a colleague about it, and he said it behaves like a metric space, most time... But I don't know how is our current mean recall rate...
I did some research about this. Using the squared distance may be a problem if there is some prunning that relies on distance sums. In our use case, finding the exact nearest item is not that important, as long as the distance to the item found is close to the actual minimal. Also, for "faw away items", the exact distance and the nearest neighbor is not important, as there won't be a hit anyway.
Not sure how VPTrees work internally, but I am running comparison tests using the "ground truth" (it is possible to brute force the exact nearest hash and the minimal distance, as it takes only ~100 ms, using 48 threads, to check an item against all 10 M PhotoDNA's). So we can have a better idea if we are missing anything with VPTrees, and can evaluate alternative solutions in terms of search results quality.
PS: Fixed the time spend, as I was running before with 1 M PhotoDNA hash (10% of the full data set).
Just tested with 1,000 images from a real case (~400 hits). For the hits, the distances returned by VPTree matched exactly the actual minimal distance except for 2 items. Actual minimums were 13901 and 13675 and the current implementation returned 14953 and 15049, respectively. No hits were missed. So the actual search quality seems pretty good, although this was a small test sample.
That's great to know, thank you @wladimirleite for all your tests! I think I maybe did a similar test many years ago when I implemented the search using VPTree the first time, but using a tiny data set...
Just tested with 1,000 images from a real case (~400 hits). For the hits, the distances returned by VPTree matched exactly the actual minimal distance except for 2 items.
I ran another test with a larger sample of 10,000 images, including 1,784 expected hits. From the 1,784 expected hits (i.e. squared distance < 50000), current implementation (VPTree) returned 31 slightly different values (still in the hit range), and there were 13 cases in which hits were missed (i.e. actual distance was less than 50,000, but we couldn't find the correct nearest neighbor). The "worst miss" had an actual squared distance of 19,175. So the my "updated conclusion" is that we miss a small percentage of the hits, but overall the current implementation results are very good.
After "fixing" the distance function (using a SQRT), all results matched with the brute force approach, but the search became way to slow. That happens because currently it is aggressively (and strictly speaking, incorrectly) pruning a lot of branches, which in rare cases discard a branch with the nearest neighbor.
I implemented a simple vantage point tree, and got a similar behavior to the library we currently use. The problem is that with many dimensions (144), there are too many ways of points to be far from each other, so the geometric pruning behind VPTrees isn't that effective, and too many nodes needed to be compared. On that 10,000 images sample, 41% of the nodes had to be compared to the evaluated image, which is way too high (using the "correct" distance function).
Thanks @wladimirleite for all your tests! So seems we have <= 1% of false negatives, seems good to me. I think I had 1 miss from ~200 hits in my past tests. All the other algorithms I pointed out are approximate nearest search algorithms, meaning they also have some false negative rate.
PS: PhotoDNA documentation recommends using a 40,000 threshold for the squared euclidian distance, but I kept 50,000, which I found before with experimentation before knowing the official value, because it would miss less hits.
PS: PhotoDNA documentation recommends using a 40,000 threshold for the squared Euclidean distance, but I kept 50,000, which I found before with experimentation before knowing the official value, because it would miss less hits.
After reviewing the PhotoDNA documentation, I believe the search performance issue for a large number of items was overlooked. The documented complexity assumes a general case where data points are clustered in the multidimensional space, which doesn't apply to the PhotoDNA hashes we generate.
As previously discussed, using the standard Euclidean distance (non-squared) results in a low pruning factor, forcing the search to scan too many items. Based on my tests with various libraries and custom implementations, the best approach is to stick with our current VPTree structure and distance formula.
However, there is an opportunity for improvement: the initial loading time, which is currently 1-2 minutes for our database. The library we use is no longer maintained, but its implementation is solid. A drawback is the lack of a serialization feature, which forces us to rebuild the entire data structure on every startup. This startup time is expected to grow as we expand our PhotoDNA database. While not critical, since the main processing can take hours, this slow initialization may be annoying.
We could reduce this time by a factor of 5 to 10 with a minor modification to the library's code. This would allow us to cache and reuse the VPTree when no database changes are detected (a pattern used elsewhere in our system).
@lfcnassif, what are your thoughts on implementing this improvement? If you agree, what would be the best approach: adding the modified source files directly to our project or creating a custom JAR for our repository?
Hi @wladimirleite!
I've already thought about extending jvptree library to support serialization. Although its MIT license is very permissive, I think it is better to keep things apart. I've just forked the library into our organization account: https://github.com/sepinf-inc/jvptree
After we push the needed changes, we can easily point to a desired tag or commit in our pom.xml files and IPED will link to a snapshot version built on the fly, like we do for telegram-decoder dependency today, no need to publish artifacts on maven or on a custom repo.
Based on @wladimirleite' tests, I'm closing this.
Based on @wladimirleite' tests, I'm closing this.
We can reopen (or create a new issue) in the future, but currently it doesn't seem to exist an easy out-of-the-box alternative.