ThreatExchange [pdq] optimize PDQ linear scan implementation

Closes #1810

Summary

This is a breaking API change and need some notice in release notes, as the alignment of Hash256 is now 64-bit, although it should only affect users' who reuse old artifacts (static/dynamic libs) while building new code using the new header. No downstream code change should be needed.

I also refactored the pdq benchmark data structure to not interleave strings with actual hash data, it is bad for caching and increases TLB pressure that would hurt both the existing MIH implementation and linear scan. (Also fixed a logic bug that can cause arithmetic wraparound/infinite looping in the benchmark execution, not relevant to results.

I can probably shave off some constant factor in MIH too but that won't change the big picture, so I will defer it to maybe a later PR if needed.

Performance as of now:

> CFLAGS="-march=znver4 -funroll-loops -ftree-vectorize -O3" make benchmark-query
>  ./benchmark-query -b (math 10_000_000)  -m $method

MIH index build time: 74.216144

METHOD: mutually-indexed hashing query
QUERY COUNT:             1000 * 8 = 8000
INDEX COUNT:             10000000
TOTAL MATCH COUNT:       8000
TOTAL QUERY SECONDS:     59.697593
QUERIES PER SECOND:     16.75
THROUGHPUT (millions of amortized tests/sec): 1340.09

METHOD: linear query (SIMD accelerated)
QUERY COUNT:             1000 * 8 = 8000
INDEX COUNT:             10000000
TOTAL MATCH COUNT:       8000
TOTAL QUERY SECONDS:     14.524117
QUERIES PER SECOND:     68.85
THROUGHPUT (millions of amortized tests/sec): 5508.08

Assembly for queryAll hot path

   151d0:	62 71 fd 48 6f cf    	vmovdqa64 zmm9,zmm7
   151d6:	62 71 fd 48 6f de    	vmovdqa64 zmm11,zmm6
   151dc:	62 71 fd 48 6f ed    	vmovdqa64 zmm13,zmm5
   151e2:	62 71 fd 48 6f fc    	vmovdqa64 zmm15,zmm4
   151e8:	62 53 b5 58 25 0f 99 	vpternlogq zmm9,zmm9,QWORD BCST [r15],0x99
   151ef:	62 53 a5 58 25 5f 01 	vpternlogq zmm11,zmm11,QWORD BCST [r15+0x8],0x99
   151f6:	99 
   151f7:	62 52 fd 48 55 d1    	vpopcntq zmm10,zmm9
   151fd:	62 53 95 58 25 6f 02 	vpternlogq zmm13,zmm13,QWORD BCST [r15+0x10],0x99
   15204:	99 
   15205:	62 53 85 58 25 7f 03 	vpternlogq zmm15,zmm15,QWORD BCST [r15+0x18],0x99
   1520c:	99 
   1520d:	62 52 fd 48 55 e3    	vpopcntq zmm12,zmm11
   15213:	62 52 fd 48 55 f5    	vpopcntq zmm14,zmm13
   15219:	62 d1 ad 48 d4 c4    	vpaddq zmm0,zmm10,zmm12
   1521f:	62 d2 fd 48 55 d7    	vpopcntq zmm2,zmm15
   15225:	62 d1 fd 48 d4 ce    	vpaddq zmm1,zmm0,zmm14
   1522b:	62 f1 f5 48 d4 da    	vpaddq zmm3,zmm1,zmm2
   15231:	62 d3 e5 48 1f c0 05 	vpcmpnltq k0,zmm3,zmm8
   15238:	c5 f9 98 c0          	kortestb k0,k0
   1523c:	0f 85 7d 01 00 00    	jne    153bf <_ZL5queryPciPS_+0x165f>
   15242:	48 ff c2             	inc    rdx
   15245:	49 83 c7 20          	add    r15,0x20
   15249:	4c 39 ca             	cmp    rdx,r9
   1524c:	72 82                	jb     151d0 <_ZL5queryPciPS_+0x1470>

"Barcelona" is AMD K10, a 2007 CPU with almost no SIMD support, it is placed here for comparison for difference achievable only using scalar optimizations (alignment, scalar POPCNT r/64 encouragement that at least LLVM seems to fail to do automatically, etc).

> CFLAGS="-march-barcelona -funroll-loops -ftree-vectorize -O3" make benchmark-query
> ./benchmark-query -b (math 10_000_000) -m $method
MIH index build time: 63.761978

METHOD: mutually-indexed hashing query
QUERY COUNT:             1000 * 8 = 8000
INDEX COUNT:             10000000
TOTAL MATCH COUNT:       8000
TOTAL QUERY SECONDS:     58.247728
QUERIES PER SECOND:     17.17
THROUGHPUT (millions of amortized tests/sec): 1373.44

METHOD: linear query
QUERY COUNT:             1000 * 8 = 8000
INDEX COUNT:             10000000
TOTAL MATCH COUNT:       8000
TOTAL QUERY SECONDS:     62.094976
QUERIES PER SECOND:     16.10
THROUGHPUT (millions of amortized tests/sec): 1288.35

Test Plan

The matching is mirrored from my Rust implementation with no logical changes, which has robust Unit testing and edge case testing, plus end-to-end image matching tests, the benchmark results are also comparable to my statistics-based Criterion benchmark in Rust implementation.

Unit test and Rust implementation: https://github.com/eternal-flame-AD/yume-pdq/blob/main/src/matching/mod.rs End to end test using synthetic DB: https://github.com/eternal-flame-AD/yume-pdq/blob/main/examples/end_to_end.rs

A comprehensive Unit test is written in C++ too using similar methodology (exhaustively check every possible combination of needle/haystack index), Test results:

Using scalar flat index / Using SIMD accelerated flat index
Testing maxDistance = 0
        PASS: initially no hits
        PASS: 1 hit
        PASS: 2 hits
        PASS: no edge-case false positives
        PASS: no hits at the end
Testing maxDistance = 31
        PASS: initially no hits
        PASS: 1 hit
        PASS: 2 hits
        PASS: no edge-case false positives
        PASS: no hits at the end
Testing maxDistance = 64
        PASS: initially no hits
        PASS: 1 hit
        PASS: 2 hits
        PASS: no edge-case false positives
        PASS: no hits at the end
Testing correctness under misuse
        PASS: correctness under misuse

Since MIH index used int for specifying threshold I followed suit, but I added a test to make sure even under nonsensical input (not within 0-255) the answer is still technically correct.

Load/Store alignment is tested fully using RFLAGS AC bit in both Rust and C++ implementation.

Although we won't be able to test the AVX512 kernel on Github Actions, AFAIK it does not have AVX512, and QEMU also does not support it, for now we have to test locally.

Apr 26 '25 08:04 eternal-flame-AD

As a disclosure, you might notice that the performance of MIH increase dramatically at really big query counts, this artifact of hashmap micro-architecture behavior and is not achievable in production environments IMO.

The benchmark is inherently O(n) with regards to "query count", thus this speedup is almost certainly branch prediction or caching (cache learned where the matches artificially introduced in the indices are, thus boosting the cache hit rate), in a noisy, multi-threaded environment with a lot of irrelevant code in between there is very little opportunity for a CPU core to learn complex data access behavior that cannot be amortized in 10k queries under ideal back-to-back query conditions..

Apr 26 '25 08:04 eternal-flame-AD

I think this is ready for review.

The below is not reproducible without some sort of LTO passing through the code, on Godbolt the ASM is very clean.

The ASM is a little more convoluted than the one generated from LLVM Rust for some reason, I "agree" with the rust output more on how it should look like.
I am not too familiar with C++ codegen optimization, I tried writing exactly what the ASM I want says GCC still rewrote it, so it isn't some syntax/alias thing but the optimizer is actively saying they think this is better. I don't think I can fight back on that except write asm directly.

I asked AI it says it is GCC's heuristics prefer to first do a move then do ternary when LLVM prefer to do broadcast then xor with broadcast operand directly as I prescribed, but the speedup is there and the answer is correct, so I guess it doesn't really hurt.

Apr 27 '25 10:04 eternal-flame-AD

Please copy any modified pdq files to vpdq/cpp/pdq/cpp as well (this sucks...)

Apr 27 '25 20:04 ianwal

Done, I copied the core files there that should have 100% the same behavior but I am not familiar with vpdq so not sure how to wire it up.

I think it is best handled in another PR if needed as much more things need to be validated and the diff is already not small.

Apr 27 '25 20:04 eternal-flame-AD

Done, I copied the core files there that should have 100% the same behavior but I am not familiar with vpdq so not sure how to wire it up.

I think it is best handled in another PR if needed as much more things need to be validated and the diff is already not small.

No need to copy the new index/flat.h file to vpdq. Actually, the whole index folder could be removed from vpdq because it's not used, but you don't need to do that in this PR. CI runs tests for vpdq if there's changes to pdq, so no need to worry about testing anything manually there.

Apr 27 '25 20:04 ianwal

@ianwal I think this should fully address your concern on intrinsics, I did more experiment to confirm that both the data layout and intrinsics are necessary for efficient code generation.

Since this were brought up I will share some history on development and how I came to this instruction sequence,

If I recall correctly my initial solution was pretty similar to Faiss (straightforward vpxor[d|q] then vpopcnt[d|q] then do a horizontal add), I benchmarked it was not faster than a properly aligned scalar solution.

That is intuitive too, there are 4 qwords in a hash so even if the POPCNT is free, at least 2 permutations and then 2 additions (which has at least 1 chain of dependency) would be required for each POPCNT result to be reduced into a single distance metric that could be obtained using cheaper scalar sequences, but to get further performance we need to spend on average <3 cycles per hash so obviously that won't cut it.

So I thought about with the natural dihedral batching we can do this one-time transpose (AoS -> SoA) approach which saved all the expensive, ILP-hard horizontal operations with broadcasts and vertical additions, and vectorized tests. And indeed with no unaligned loads or horizontal operations we can do 8 things at once with minimal overhead and thus we can get this throughput.

I gave it my best attempt to emulate using pure autovectorized code (isn't in any way more readable IMO), still resulted in compiler-inserted permutation that will be almost guaranteed to be much slower, also GCC15 completely refused to vectorize this code for some reason. https://godbolt.org/z/9cqfjM8cE

"Emulated AVX512":

Intrinsics:

Apr 28 '25 05:04 eternal-flame-AD

Sorry, I'm way underwater. Thanks @ianwal for the help here!

This is a breaking API change and need some notice in release notes

We aren't maintaining release notes at the moment, but I can start a document. The biggest source of downstream usage that I know of is https://github.com/faustomorales/pdqhash-python (pdq bindings).

Huge huge thanks for the high quality PR here with unittests. I haven't coded C++ in like a decade, but I'll do my best look.

Apr 28 '25 20:04 Dcallies

It seems like clang isn't capturing .clang-format file because I did ran make format before committing, I looked at the CI logs and ran that container, if this doesn't fix the formatting complaint I probably don't know how to fix it .

Thanks for the approval, yes I am not entirely sure either whether it is worth it to incorporate this in the python threatexchange, as managing the data structure for live mutations are possible but requires some high level wrapping like paging and atomic operations that are difficult to do with Python, and platforms requiring squeezing (60ms vs. 15ms) performance/consistency like this probably already have C++/Rust codebase with their own preferred way to do it.

Apr 28 '25 23:04 eternal-flame-AD

Do you need to approve workflow run @Dcallies? Let's double check for any new compiler warnings that show up there before merging.

Apr 29 '25 04:04 ianwal

Yeah, it needs me to manually run the CI for non-contributors (security step).

Looks like we are aligned on landing, @eternal-flame-AD , we just need to get the CI green again and we can merge!

Apr 29 '25 11:04 Dcallies

I fixed the CMake complaint but It seems there is some weirdness going on with the Makefile warning flags, it first sets a WFLAGS that is different for each compiler than resets to another constant value?

https://github.com/facebook/ThreatExchange/blob/5eae856ab2e8ed6984469e97beec6958c463507d/pdq/cpp/Makefile#L2-L17

I tried make wasm it doesn't seem to work for me, but unrelated to this PR (probably shoudn't set Werror as un-overridable default?):

 make wasm                                                                                                                                                                                                                                                  10:08:14
emcc -g -std=c++11 -D BUILD_WASM -s ERROR_ON_UNDEFINED_SYMBOLS=0 -s EXPORTED_RUNTIME_METHODS=['ccall','UTF8ToString ']  -s  EXPORTED_FUNCTIONS=['_getHash'] -s FORCE_FILESYSTEM=1 -s LLD_REPORT_UNDEFINED -s ERROR_ON_UNDEFINED_SYMBOLS=0 -s DISABLE_EXCEPTION_CATCHING=0 -s ALLOW_MEMORY_GROWTH=1  -I../.. -I. -Wall -Wno-char-subscripts -Wno-stringop-truncation -Wno-class-memaccess -Wno-misleading-indentation -Wno-unknown-warning-option -Werror -O3 bin/pdq-photo-hasher.cpp common/pdqhashtypes.cpp io/hashio.cpp common/pdqhamming.cpp hashing/pdqhashing.cpp downscaling/downscaling.cpp io/pdqio.cpp hashing/torben.cpp common/pdqutils.cpp -o pdq-photo-hasher.js -lm -lpthread
In file included from bin/pdq-photo-hasher.cpp:7:
In file included from ../../pdq/cpp/io/pdqio.h:14:
./CImg.h:81800:23: error: first argument in call to 'memcpy' is a pointer to non-trivially copyable type 'CImg<char>' [-Werror,-Wnontrivial-memcall]
 81800 |           std::memcpy(new_data, _data, sizeof(CImg<T>) * npos);
       |                       ^
./CImg.h:81867:7: note: in instantiation of member function 'cimg_library::CImgList<char>::insert' requested here
 81867 |       insert(empty, npos + i);
       |       ^
./CImg.h:16345:18: note: in instantiation of member function 'cimg_library::CImgList<char>::insert' requested here
 16345 |     move_to(list.insert(1, npos)[npos]);
       |                  ^
./CImg.h:87439:27: note: in instantiation of function template specialization 'cimg_library::CImg<char>::move_to<char>' requested here
 87439 |             full_filename.move_to(res);
       |                           ^
./CImg.h:81800:23: note: explicitly cast the pointer to silence this warning
 81800 |           std::memcpy(new_data, _data, sizeof(CImg<T>) * npos);
       |                       ^
       |                       (void*)
./CImg.h:81803:31: error: first argument in call to 'memcpy' is a pointer to non-trivially copyable type 'CImg<char>' [-Werror,-Wnontrivial-memcall]
 81803 |               new_data + npos + 1,
       |                               ^
./CImg.h:81803:31: note: explicitly cast the pointer to silence this warning
 81803 |               new_data + npos + 1,
       |                               ^
       |               (void*)
./CImg.h:81819:21: error: first argument in call to 'memset' is a pointer to non-trivially copyable type 'CImg<char>' [-Werror,-Wnontrivial-memcall]
 81819 |         std::memset(_data, 0, sizeof(CImg<T>) * (_width - 1));
       |                     ^
./CImg.h:81819:21: note: explicitly cast the pointer to silence this warning
 81819 |         std::memset(_data, 0, sizeof(CImg<T>) * (_width - 1));
       |                     ^
       |                     (void*)

<...>

Apr 29 '25 15:04 eternal-flame-AD

@Dcallies any updates? Can you approve the CI so I know it passes?

The WASM bug seems to be unrelated to this PR.

I am not familiar with current state of higher level abstractions, if we already have C++ dependency for HMA maybe we can introduce this in a later PR through Python API (I haven't thought through the details, but if 100% recall is required I believe this is currently the most competitive solution for PDQ).

May 01 '25 05:05 eternal-flame-AD

Apologies, swamped on my end, I'll give notice to @faustomorales that there's an updated PDQ version.

May 07 '25 11:05 Dcallies

Apologies, swamped on my end, I'll give notice to @faustomorales that there's an updated PDQ version.

Here to report that the Python bindings include this new version on v0.2.8 🫡 🎉 😂

May 28 '25 23:05 faustomorales