jpeg-decoder
jpeg-decoder copied to clipboard
Change color_convert_line_cmyk so it gets auto vectorized
Finally came back to this PR, looking back into it changing the code so the compiler can auto vectorize seemed like the best solution
Could you elaborate a bit more on this change? I'd be hesitant to add platform specific unsafe SIMD intrinsics unless there was a substantial performance improvement, and no alternative way to get the same speedup.
Maybe this function could be vectorized in a platform-independent manner by applying a bitwise NOT on u32 values instead ?
Given how simple the loop is, it is definitely possible to get the compiler to autovectorize it. However there is a simpler question: is this function actually a performance bottleneck at all? I'd strongly suspect that the JPEG decoding is much more costly than a single pass over the image data
@fintelia : This crate has no benchmark, making it difficult to take optimization decisions. What would you think about a PR adding one or two benchmarks, and adding cargo bench to the CI ?
@lovasoa That would be awesome to have
I added a benchmark in #122
The function in question is now also doing shuffling (i.e. it gathers results from 4 vectors, each with its own component, and interleaves them in the output).
https://github.com/image-rs/jpeg-decoder/blob/09a2c2d81c89014f1477a2507b20f5fa9e3d8e01/src/decoder.rs#L1280-L1296
The shuffling is necessary because parallelization is per-component.
So this optimization is no longer applicable.
It might be worth considering splitting the shuffling and the actual color conversion, so that the color conversion coukd still be vectorized.
I've scanned ~10,000 JPEGs and failed to find any file that would trigger this codepath, so I'm just going to assume this is a rarely taken codepath that's not really worth micro-optimizing.