motion icon indicating copy to clipboard operation
motion copied to clipboard

Add optimized SSE2 routines for bottleneck functions

Open aklomp opened this issue 11 years ago • 2 comments
trafficstars

This pull request provides SSE2 vectorized implementations of alg_update_reference_frame() and alg_noise_tune(). Profiling with Callgrind on my Atom server showed that the first was the most expensive single function call. Rewriting it in branchless SSE2 code cuts Motion's load average roughly in half for me. Per-function benchmarking shows a speedup of around 2× for alg_update_reference_frame(), and around 4× for alg_noise_tune(). Results differ across hardware and compilers, but always show significant speedup.

The plain functions have been lifted out of alg.c and placed in the new alg/ subdirectory, along with their SSE2 versions. In alg.c, the preprocessor chooses between including plain or SSE2 functions at compile time. This is perhaps not in line with the rest of the codebase, but made it possible to build a test harness around the functions that checks their correctness and does performance benchmarks. This harness can be found in alg/tests.

My website has a writeup of how I converted alg_update_reference_frame() to branchless code. It demonstrates how to derive the logic step by step, and is a kind of giant comment on the code. Hopefully it will help in reviewing the code for correctness.

This code took quite some time to write, but probably deserves more real-world testing than it's had so far. All comments welcome!

aklomp avatar Sep 10 '14 20:09 aklomp

That's nice! Would ARM platform (SIMD, NEON) benefit from this?

tosiara avatar Sep 11 '14 04:09 tosiara

No, ARM won't benefit from this patch directly. The functions I added use x86 SSE2 intrinsics, and those are not portable. Platforms that don't support SSE2 will keep using the current "plain" implementation. Here's the compile-time dispatcher for one of the functions.

Of course the vectorized algorithm could be ported to NEON intrinsics, and someone could add an alg/alg_noise_tune.neon.c to the dispatcher.

aklomp avatar Sep 11 '14 07:09 aklomp