rav1e
rav1e copied to clipboard
sad_32x32 and 64x64 AVX2 has poor cache locality
This at least applies to the HBD ASM, I have not tested against LBD. Benchmarking is showing a large number of cache read misses. Noting this as a possible area for performance improvement.
Could you please add how you determined that so willing people can repeat the exercise? :)
Yes, this was measured using valgrind, specifically in this case valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes --simulate-cache=yes target/release/rav1e -s 2 --no-scene-detection -i 0 -I 0 ~/xiph-media-files/objective-1-fast-10bit/speed_bag_640x360_60f.y4m -o /dev/null --limit 20
. valgrind measures cache misses as one of its metrics and this can be viewed in kcachegrind. (The downside is that valgrind is quite a bit slower than perf.)
This might not be the SAD itself really but rather the nature of e.g. motion compensation. Is this specific to AVX2?