loop filter is slow
On this lossy image, image-webp is 2.8x slower than dwebp and 1.4x slower than dwebp -noasm
The profile points to the loop filter being the worst offender, accounting for half the decoding time: https://share.firefox.dev/45Jjlhp
image_webp::loop_filter::should_filter alone is taking up 25% of the total execution time, and it looks fairly simple to optimize.
I wonder what strides are valid - perhaps we could use const generics to specialize the implementation for each stride, like in #134?
Half the invocations have a hardcoded stride = 1 (presumably for filtering horizontal edges) so a specialized implementation for that case should be beneficial. It is easy to vectorize since all the values are adjacent.
I was looking at this again after the recent activity, it's hard to see places to parallelize or reuse work within loop_filter.rs.
The one place I could see is in loop_filter::should_filter_horizontal where I don't think abs_diff comparisons can be vectorized without explicit code.
After doing so, the largest effect is on the subblock filter but not sure what the end-to-end timings are. This can be done in safe intrinsics but requires unsafe for the target feature call.
# SIMD
test loop_filter::benches::measure_horizontal_macroblock_filter ... bench: 59.75 ns/iter (+/- 1.97)
test loop_filter::benches::measure_horizontal_subblock_filter ... bench: 38.85 ns/iter (+/- 2.11)
# Scalar
test loop_filter::benches::measure_horizontal_macroblock_filter ... bench: 66.86 ns/iter (+/- 2.26)
test loop_filter::benches::measure_horizontal_subblock_filter ... bench: 62.47 ns/iter (+/- 2.31)
https://rust.godbolt.org/z/cdj3zzffr (updated link, forgot the assert in the first version) simd_horizontal.patch
This didn't have a positive effect when I applied it to the vertical filter. That would probably require re-architecting higher up in the stack in vp8.rs.