ThreatExchange
ThreatExchange copied to clipboard
PDQ: Expected hashes of regression tests differ between reference implementations
In the Java reference implementation, there is a test that asserts that the hash of the sample image "bridge-1-original.jpg" is "f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22". See: https://github.com/facebook/ThreatExchange/blob/main/pdq/java/src/test/java/hasher/pdqhashing/PDQHasherTest.java
In the C++ readme, the hash of the same image is said to be "f8f8f0cce0f4e84d0e370a22028f67f0b36e2ed596623e1d33e6339c4e9c9b22". See: https://github.com/facebook/ThreatExchange/tree/main/pdq/cpp/bin
I can't find a mention in the PHP code that specifically tests this file, but the hash "f8f8f0cce0f4e84d0e370a22028f67f0b36e2ed596623e1d33e6339c4e9c9b22" (same as for C++) is used as an example here: https://github.com/facebook/ThreatExchange/blob/main/pdq/php/pdqhashtest.php
In the python code (is this an official reference implementation?) it seems that the expected hash is yet another hash: "d8f8f0cce0f4a84f0e370a22028f67f0b36e2ed596623e1d33e6b39c4e9c9b22". See: https://github.com/facebook/ThreatExchange/blob/main/pdq/python/pdqhashing/tests/pdq_test.py
In the SQL examples, the hash "f8f8f0cce0f4e84d0e370a22028f67f0b36e2ed596623e1d33e6339c4e9c9b22" (same as C++) is used as an example. See: https://github.com/facebook/ThreatExchange/tree/main/pdq/sql
In short: It seems that the Java and Python implementations differs in expectation about the hash value of that image compared to C++. This seems problematic. Why are they not the same? Are there bugs in the reference implementations? Am I missing something?
Maybe reading the documentation would help you: https://raw.githubusercontent.com/facebook/ThreatExchange/main/hashing/hashing.pdf
For this reason we have chosen a tradeoff: (a) Use implementation-specific downsampler to get to 512x512; (b) use the two-pass tent filter to go from 512x512 to 64x64. This is a balance between efficient computation, and a few bits (0/2/4/6) of hamming distance between different implementations given the same image.
The downsizing of the image in different implementations leads to slightly different hashes...
Thanks for the reference!
After examining the C++/Java/Python implementation a bit more closely, it seems like:
- C++ uses CImg, which in turn defaults to nearest neighbor.
- The Java implementation does not have any logic for downsampling before the hash computation is done, so hashing is done on the full resolution image.
- Python uses PIL, which in turn defaults to bicubic.
So if I want some sample image along with an expected hash as a regression test for a new implementation, it seems I should use the hash from the Java implementation to verify that the algorithm is implemented correctly.
I could not find any guideline in the documentation about what type of downsamling to use. I suspect that this might impact the precision. Any insight to share about this?
@BarrettOlson was looking into the exact same problem for verifying cross-languages. The downsampling problem makes it a pain to verify correctness for sure. Barrett, did you have any luck with pre-downsampled inputs, or trying to intercept the bits post-resizing?
Yes, I believe downsampling is creating these differences. The original authors intent was to have users replace CImg.h with their own (potentially proprietary and more efficient) image processing library with the expectation that a few bits difference (0/2/4/6) was acceptable (given a common distance threshold being closer to 31).
I am looking into better (more exact) verification options such as, having an intermediate representation or images already sized 512x512 as input but don’t have much to add to this thread yet. Also open to suggestions! :)
Thanks for the info!
My personal suggestion would be that the library used to hash an image does not care about resolution and downsampling, and then the user of that lib takes care of any preprocessing on the image. This would make it much easier to test as a black-box, plus the user can actually choose to have the hash computed on the full resolution if that is important for the use case.
Having regression tests on images below 512x512px seems like it would solve the issue of verification cross-language for now. But if the downsampling logic is in the hashing code, the tests will be sensitive to future modifications to this 512px limit.
I'll use the hash from the Java tests, since it does not do any downsampling from what I can see.