Optimize scalar ycbcr conversion
This gives the compiler more information to vectorize the code. On zen3 with target-cpu=native this is nearly 40% faster in the ycbcr criterion micro benchmark than the current avx2 code path.
This looks very promising! If we could also generate versions with #[target_feature] attributes for SSE 4.2 and AVX2 and dispatch to those, that'd be great!
The multiversion crate makes that easy, but I don't know if this will work inside a macro. If it doesn't, we change this function from being generated in a macro to being a const generic function.
I can confirm the performance gain compared to explicit AVX on desktop Zen 4 as well, built with -C target-cpu=x86-64-v3 as opposed to native that would tune for my CPU specifically.