image-webp icon indicating copy to clipboard operation
image-webp copied to clipboard

loop filter is slow

Open Shnatsel opened this issue 8 months ago • 2 comments

On this lossy image, image-webp is 2.8x slower than dwebp and 1.4x slower than dwebp -noasm

honk.webp.zip

The profile points to the loop filter being the worst offender, accounting for half the decoding time: https://share.firefox.dev/45Jjlhp

image_webp::loop_filter::should_filter alone is taking up 25% of the total execution time, and it looks fairly simple to optimize.

I wonder what strides are valid - perhaps we could use const generics to specialize the implementation for each stride, like in #134?

Shnatsel avatar Jun 04 '25 19:06 Shnatsel

Half the invocations have a hardcoded stride = 1 (presumably for filtering horizontal edges) so a specialized implementation for that case should be beneficial. It is easy to vectorize since all the values are adjacent.

Shnatsel avatar Jun 04 '25 19:06 Shnatsel

I was looking at this again after the recent activity, it's hard to see places to parallelize or reuse work within loop_filter.rs.

The one place I could see is in loop_filter::should_filter_horizontal where I don't think abs_diff comparisons can be vectorized without explicit code.

After doing so, the largest effect is on the subblock filter but not sure what the end-to-end timings are. This can be done in safe intrinsics but requires unsafe for the target feature call.

# SIMD
test loop_filter::benches::measure_horizontal_macroblock_filter     ... bench:          59.75 ns/iter (+/- 1.97)
test loop_filter::benches::measure_horizontal_subblock_filter       ... bench:          38.85 ns/iter (+/- 2.11)
# Scalar
test loop_filter::benches::measure_horizontal_macroblock_filter     ... bench:          66.86 ns/iter (+/- 2.26)
test loop_filter::benches::measure_horizontal_subblock_filter       ... bench:          62.47 ns/iter (+/- 2.31)

https://rust.godbolt.org/z/cdj3zzffr (updated link, forgot the assert in the first version) simd_horizontal.patch

This didn't have a positive effect when I applied it to the vertical filter. That would probably require re-architecting higher up in the stack in vp8.rs.

okaneco avatar Aug 25 '25 04:08 okaneco