Optimize scalar ycbcr conversion

Open vstroebel opened this issue 3 months ago • 2 comments

This gives the compiler more information to vectorize the code. On zen3 with target-cpu=native this is nearly 40% faster in the ycbcr criterion micro benchmark than the current avx2 code path.

Nov 25 '25 18:11 vstroebel

This looks very promising! If we could also generate versions with #[target_feature] attributes for SSE 4.2 and AVX2 and dispatch to those, that'd be great!

The multiversion crate makes that easy, but I don't know if this will work inside a macro. If it doesn't, we change this function from being generated in a macro to being a const generic function.

Nov 29 '25 16:11 Shnatsel

I can confirm the performance gain compared to explicit AVX on desktop Zen 4 as well, built with -C target-cpu=x86-64-v3 as opposed to native that would tune for my CPU specifically.

Nov 29 '25 18:11 Shnatsel