[pdq] optimize PDQ linear scan implementation
Closes #1810
Summary
This is a breaking API change and need some notice in release notes, as the alignment of Hash256 is now 64-bit, although it should only affect users' who reuse old artifacts (static/dynamic libs) while building new code using the new header. No downstream code change should be needed.
I also refactored the pdq benchmark data structure to not interleave strings with actual hash data, it is bad for caching and increases TLB pressure that would hurt both the existing MIH implementation and linear scan. (Also fixed a logic bug that can cause arithmetic wraparound/infinite looping in the benchmark execution, not relevant to results.
I can probably shave off some constant factor in MIH too but that won't change the big picture, so I will defer it to maybe a later PR if needed.
Performance as of now:
> CFLAGS="-march=znver4 -funroll-loops -ftree-vectorize -O3" make benchmark-query
> ./benchmark-query -b (math 10_000_000) -m $method
MIH index build time: 74.216144
METHOD: mutually-indexed hashing query
QUERY COUNT: 1000 * 8 = 8000
INDEX COUNT: 10000000
TOTAL MATCH COUNT: 8000
TOTAL QUERY SECONDS: 59.697593
QUERIES PER SECOND: 16.75
THROUGHPUT (millions of amortized tests/sec): 1340.09
METHOD: linear query (SIMD accelerated)
QUERY COUNT: 1000 * 8 = 8000
INDEX COUNT: 10000000
TOTAL MATCH COUNT: 8000
TOTAL QUERY SECONDS: 14.524117
QUERIES PER SECOND: 68.85
THROUGHPUT (millions of amortized tests/sec): 5508.08
Assembly for queryAll hot path
151d0: 62 71 fd 48 6f cf vmovdqa64 zmm9,zmm7
151d6: 62 71 fd 48 6f de vmovdqa64 zmm11,zmm6
151dc: 62 71 fd 48 6f ed vmovdqa64 zmm13,zmm5
151e2: 62 71 fd 48 6f fc vmovdqa64 zmm15,zmm4
151e8: 62 53 b5 58 25 0f 99 vpternlogq zmm9,zmm9,QWORD BCST [r15],0x99
151ef: 62 53 a5 58 25 5f 01 vpternlogq zmm11,zmm11,QWORD BCST [r15+0x8],0x99
151f6: 99
151f7: 62 52 fd 48 55 d1 vpopcntq zmm10,zmm9
151fd: 62 53 95 58 25 6f 02 vpternlogq zmm13,zmm13,QWORD BCST [r15+0x10],0x99
15204: 99
15205: 62 53 85 58 25 7f 03 vpternlogq zmm15,zmm15,QWORD BCST [r15+0x18],0x99
1520c: 99
1520d: 62 52 fd 48 55 e3 vpopcntq zmm12,zmm11
15213: 62 52 fd 48 55 f5 vpopcntq zmm14,zmm13
15219: 62 d1 ad 48 d4 c4 vpaddq zmm0,zmm10,zmm12
1521f: 62 d2 fd 48 55 d7 vpopcntq zmm2,zmm15
15225: 62 d1 fd 48 d4 ce vpaddq zmm1,zmm0,zmm14
1522b: 62 f1 f5 48 d4 da vpaddq zmm3,zmm1,zmm2
15231: 62 d3 e5 48 1f c0 05 vpcmpnltq k0,zmm3,zmm8
15238: c5 f9 98 c0 kortestb k0,k0
1523c: 0f 85 7d 01 00 00 jne 153bf <_ZL5queryPciPS_+0x165f>
15242: 48 ff c2 inc rdx
15245: 49 83 c7 20 add r15,0x20
15249: 4c 39 ca cmp rdx,r9
1524c: 72 82 jb 151d0 <_ZL5queryPciPS_+0x1470>
"Barcelona" is AMD K10, a 2007 CPU with almost no SIMD support, it is placed here for comparison for difference achievable only using scalar optimizations (alignment, scalar POPCNT r/64 encouragement that at least LLVM seems to fail to do automatically, etc).
> CFLAGS="-march-barcelona -funroll-loops -ftree-vectorize -O3" make benchmark-query
> ./benchmark-query -b (math 10_000_000) -m $method
MIH index build time: 63.761978
METHOD: mutually-indexed hashing query
QUERY COUNT: 1000 * 8 = 8000
INDEX COUNT: 10000000
TOTAL MATCH COUNT: 8000
TOTAL QUERY SECONDS: 58.247728
QUERIES PER SECOND: 17.17
THROUGHPUT (millions of amortized tests/sec): 1373.44
METHOD: linear query
QUERY COUNT: 1000 * 8 = 8000
INDEX COUNT: 10000000
TOTAL MATCH COUNT: 8000
TOTAL QUERY SECONDS: 62.094976
QUERIES PER SECOND: 16.10
THROUGHPUT (millions of amortized tests/sec): 1288.35
Test Plan
The matching is mirrored from my Rust implementation with no logical changes, which has robust Unit testing and edge case testing, plus end-to-end image matching tests, the benchmark results are also comparable to my statistics-based Criterion benchmark in Rust implementation.
Unit test and Rust implementation: https://github.com/eternal-flame-AD/yume-pdq/blob/main/src/matching/mod.rs End to end test using synthetic DB: https://github.com/eternal-flame-AD/yume-pdq/blob/main/examples/end_to_end.rs
A comprehensive Unit test is written in C++ too using similar methodology (exhaustively check every possible combination of needle/haystack index), Test results:
Using scalar flat index / Using SIMD accelerated flat index
Testing maxDistance = 0
PASS: initially no hits
PASS: 1 hit
PASS: 2 hits
PASS: no edge-case false positives
PASS: no hits at the end
Testing maxDistance = 31
PASS: initially no hits
PASS: 1 hit
PASS: 2 hits
PASS: no edge-case false positives
PASS: no hits at the end
Testing maxDistance = 64
PASS: initially no hits
PASS: 1 hit
PASS: 2 hits
PASS: no edge-case false positives
PASS: no hits at the end
Testing correctness under misuse
PASS: correctness under misuse
Since MIH index used int for specifying threshold I followed suit, but I added a test to make sure even under nonsensical input (not within 0-255) the answer is still technically correct.
Load/Store alignment is tested fully using RFLAGS AC bit in both Rust and C++ implementation.
Although we won't be able to test the AVX512 kernel on Github Actions, AFAIK it does not have AVX512, and QEMU also does not support it, for now we have to test locally.
As a disclosure, you might notice that the performance of MIH increase dramatically at really big query counts, this artifact of hashmap micro-architecture behavior and is not achievable in production environments IMO.
The benchmark is inherently O(n) with regards to "query count", thus this speedup is almost certainly branch prediction or caching (cache learned where the matches artificially introduced in the indices are, thus boosting the cache hit rate), in a noisy, multi-threaded environment with a lot of irrelevant code in between there is very little opportunity for a CPU core to learn complex data access behavior that cannot be amortized in 10k queries under ideal back-to-back query conditions..
I think this is ready for review.
The below is not reproducible without some sort of LTO passing through the code, on Godbolt the ASM is very clean.
I am not too familiar with C++ codegen optimization, I tried writing exactly what the ASM I want says GCC still rewrote it, so it isn't some syntax/alias thing but the optimizer is actively saying they think this is better. I don't think I can fight back on that except write asm directly.
I asked AI it says it is GCC's heuristics prefer to first do a move then do ternary when LLVM prefer to do broadcast then xor with broadcast operand directly as I prescribed, but the speedup is there and the answer is correct, so I guess it doesn't really hurt.
Please copy any modified pdq files to vpdq/cpp/pdq/cpp as well (this sucks...)
Done, I copied the core files there that should have 100% the same behavior but I am not familiar with vpdq so not sure how to wire it up.
I think it is best handled in another PR if needed as much more things need to be validated and the diff is already not small.
Done, I copied the core files there that should have 100% the same behavior but I am not familiar with vpdq so not sure how to wire it up.
I think it is best handled in another PR if needed as much more things need to be validated and the diff is already not small.
No need to copy the new index/flat.h file to vpdq. Actually, the whole index folder could be removed from vpdq because it's not used, but you don't need to do that in this PR. CI runs tests for vpdq if there's changes to pdq, so no need to worry about testing anything manually there.
@ianwal I think this should fully address your concern on intrinsics, I did more experiment to confirm that both the data layout and intrinsics are necessary for efficient code generation.
Since this were brought up I will share some history on development and how I came to this instruction sequence,
If I recall correctly my initial solution was pretty similar to Faiss (straightforward vpxor[d|q] then vpopcnt[d|q] then do a horizontal add), I benchmarked it was not faster than a properly aligned scalar solution.
That is intuitive too, there are 4 qwords in a hash so even if the POPCNT is free, at least 2 permutations and then 2 additions (which has at least 1 chain of dependency) would be required for each POPCNT result to be reduced into a single distance metric that could be obtained using cheaper scalar sequences, but to get further performance we need to spend on average <3 cycles per hash so obviously that won't cut it.
So I thought about with the natural dihedral batching we can do this one-time transpose (AoS -> SoA) approach which saved all the expensive, ILP-hard horizontal operations with broadcasts and vertical additions, and vectorized tests. And indeed with no unaligned loads or horizontal operations we can do 8 things at once with minimal overhead and thus we can get this throughput.
I gave it my best attempt to emulate using pure autovectorized code (isn't in any way more readable IMO), still resulted in compiler-inserted permutation that will be almost guaranteed to be much slower, also GCC15 completely refused to vectorize this code for some reason. https://godbolt.org/z/9cqfjM8cE
"Emulated AVX512":
Intrinsics:
Sorry, I'm way underwater. Thanks @ianwal for the help here!
This is a breaking API change and need some notice in release notes
We aren't maintaining release notes at the moment, but I can start a document. The biggest source of downstream usage that I know of is https://github.com/faustomorales/pdqhash-python (pdq bindings).
Huge huge thanks for the high quality PR here with unittests. I haven't coded C++ in like a decade, but I'll do my best look.
It seems like clang isn't capturing .clang-format file because I did ran make format before committing, I looked at the CI logs and ran that container, if this doesn't fix the formatting complaint I probably don't know how to fix it .
Thanks for the approval, yes I am not entirely sure either whether it is worth it to incorporate this in the python threatexchange, as managing the data structure for live mutations are possible but requires some high level wrapping like paging and atomic operations that are difficult to do with Python, and platforms requiring squeezing (60ms vs. 15ms) performance/consistency like this probably already have C++/Rust codebase with their own preferred way to do it.
Do you need to approve workflow run @Dcallies? Let's double check for any new compiler warnings that show up there before merging.
Yeah, it needs me to manually run the CI for non-contributors (security step).
Looks like we are aligned on landing, @eternal-flame-AD , we just need to get the CI green again and we can merge!
I fixed the CMake complaint but It seems there is some weirdness going on with the Makefile warning flags, it first sets a WFLAGS that is different for each compiler than resets to another constant value?
https://github.com/facebook/ThreatExchange/blob/5eae856ab2e8ed6984469e97beec6958c463507d/pdq/cpp/Makefile#L2-L17
I tried make wasm it doesn't seem to work for me, but unrelated to this PR (probably shoudn't set Werror as un-overridable default?):
make wasm 10:08:14
emcc -g -std=c++11 -D BUILD_WASM -s ERROR_ON_UNDEFINED_SYMBOLS=0 -s EXPORTED_RUNTIME_METHODS=['ccall','UTF8ToString '] -s EXPORTED_FUNCTIONS=['_getHash'] -s FORCE_FILESYSTEM=1 -s LLD_REPORT_UNDEFINED -s ERROR_ON_UNDEFINED_SYMBOLS=0 -s DISABLE_EXCEPTION_CATCHING=0 -s ALLOW_MEMORY_GROWTH=1 -I../.. -I. -Wall -Wno-char-subscripts -Wno-stringop-truncation -Wno-class-memaccess -Wno-misleading-indentation -Wno-unknown-warning-option -Werror -O3 bin/pdq-photo-hasher.cpp common/pdqhashtypes.cpp io/hashio.cpp common/pdqhamming.cpp hashing/pdqhashing.cpp downscaling/downscaling.cpp io/pdqio.cpp hashing/torben.cpp common/pdqutils.cpp -o pdq-photo-hasher.js -lm -lpthread
In file included from bin/pdq-photo-hasher.cpp:7:
In file included from ../../pdq/cpp/io/pdqio.h:14:
./CImg.h:81800:23: error: first argument in call to 'memcpy' is a pointer to non-trivially copyable type 'CImg<char>' [-Werror,-Wnontrivial-memcall]
81800 | std::memcpy(new_data, _data, sizeof(CImg<T>) * npos);
| ^
./CImg.h:81867:7: note: in instantiation of member function 'cimg_library::CImgList<char>::insert' requested here
81867 | insert(empty, npos + i);
| ^
./CImg.h:16345:18: note: in instantiation of member function 'cimg_library::CImgList<char>::insert' requested here
16345 | move_to(list.insert(1, npos)[npos]);
| ^
./CImg.h:87439:27: note: in instantiation of function template specialization 'cimg_library::CImg<char>::move_to<char>' requested here
87439 | full_filename.move_to(res);
| ^
./CImg.h:81800:23: note: explicitly cast the pointer to silence this warning
81800 | std::memcpy(new_data, _data, sizeof(CImg<T>) * npos);
| ^
| (void*)
./CImg.h:81803:31: error: first argument in call to 'memcpy' is a pointer to non-trivially copyable type 'CImg<char>' [-Werror,-Wnontrivial-memcall]
81803 | new_data + npos + 1,
| ^
./CImg.h:81803:31: note: explicitly cast the pointer to silence this warning
81803 | new_data + npos + 1,
| ^
| (void*)
./CImg.h:81819:21: error: first argument in call to 'memset' is a pointer to non-trivially copyable type 'CImg<char>' [-Werror,-Wnontrivial-memcall]
81819 | std::memset(_data, 0, sizeof(CImg<T>) * (_width - 1));
| ^
./CImg.h:81819:21: note: explicitly cast the pointer to silence this warning
81819 | std::memset(_data, 0, sizeof(CImg<T>) * (_width - 1));
| ^
| (void*)
<...>
@Dcallies any updates? Can you approve the CI so I know it passes?
The WASM bug seems to be unrelated to this PR.
I am not familiar with current state of higher level abstractions, if we already have C++ dependency for HMA maybe we can introduce this in a later PR through Python API (I haven't thought through the details, but if 100% recall is required I believe this is currently the most competitive solution for PDQ).
Apologies, swamped on my end, I'll give notice to @faustomorales that there's an updated PDQ version.
Apologies, swamped on my end, I'll give notice to @faustomorales that there's an updated PDQ version.
Here to report that the Python bindings include this new version on v0.2.8 🫡 🎉 😂