image-webp icon indicating copy to clipboard operation
image-webp copied to clipboard

Decoding animated WebP is 4x slower than `libwebp-sys` + `webp-animation`

Open Shnatsel opened this issue 1 year ago • 16 comments

In image v0.25.4 and image-webp v0.2.0, decoding the attached animated WebP is 4x slower than using libwebp-sys + webp-animation: sample.zip

image

use std::error::Error;
use image::{codecs::webp::WebPDecoder, AnimationDecoder, ImageReader};

fn main() -> Result<(), Box<dyn Error>> {
    let input = std::env::args().nth(1).unwrap();    
    let reader = ImageReader::open(input)?.into_inner();
    let decoder = WebPDecoder::new(reader)?;
    let mut iter = decoder.into_frames();
    while let Some(_frame) = iter.next() {}
    Ok(())
}

hyperfine results:

  Time (mean ± σ):     411.9 ms ±   3.9 ms    [User: 372.2 ms, System: 38.8 ms]
  Range (min … max):   407.6 ms … 416.8 ms    10 runs

libwebp-sys + webp-animation

use std::error::Error;
use webp_animation::prelude::*;

fn main() -> Result<(), Box<dyn Error>> {
    let input = std::env::args().nth(1).unwrap();    

    let buffer = std::fs::read(input).unwrap();
    let decoder = Decoder::new(&buffer).unwrap();
    let mut iter = decoder.into_iter();
    while let Some(_frame) = iter.next() {}
    Ok(())
}

hyperfine results:

  Time (mean ± σ):      95.9 ms ±   0.4 ms    [User: 128.7 ms, System: 7.4 ms]
  Range (min … max):    95.3 ms …  96.7 ms    30 runs

Analysis

webp-animation shows a bit of multi-threading happening on the profile, with user time being longer than the total execution time, but even accounting for that image-webp is 3x slower.

Breakdown of where the time is spent in image, recorded by samply: https://share.firefox.dev/4fc3utg

The greatest contributors seem to be image_webp::vp8::Vp8Decoder::decode_frame (48%), image_webp::extended::do_alpha_blending (20%), image_webp::vp8::Frame::fill_rgba (16%).

Within decode_frame the biggest contributor is image_webp::vp8::Vp8Decoder::read_coefficients (12% self time, 32% total time), and the code of that function looks like it could be optimized further to reduce bounds checks, etc. #71 is also relevant, but only accounts for 20% of the total time.

Shnatsel avatar Oct 22 '24 23:10 Shnatsel

I looked at the profile and the associated code a bit. The two low-hanging optimization opportunities are:

  1. Applying the YUV->RGB optimization from #13 to the RGBA codepath as well
  2. The alpha blending currently uses an f64 to blend every u8. No wonder that's slow - it kills vectorization. We should make it operate in integers (u16 might be enough?) and use chunks_exact() to get the compiler to vectorize it.

Shnatsel avatar Dec 14 '24 07:12 Shnatsel

Paper on fast alpha blending without divisions: https://arxiv.org/pdf/2202.02864

Shnatsel avatar Dec 15 '24 16:12 Shnatsel

Paper on fast alpha blending without divisions: https://arxiv.org/pdf/2202.02864

Here is ready method if needed https://github.com/awxkee/pic-scale-safe/blob/7f1066e006aac5eea4c1474f5bbe81cae837b2b7/src/alpha.rs#L43

Division by alpha if required much less straightforward.

awxkee avatar Dec 15 '24 19:12 awxkee

I've attempted to optimize alpha blending by performing it in u16 instead of f64. I got the primitives working (rounding integer division both by 255 and by an arbitrary u8) but when I assemble the entire thing with them it falls apart. Pretty early, too - calculating the resulting alpha is already wrong, we don't even get to the interesting part of blending the RGB values.

This is what I've got so far: https://github.com/image-rs/image-webp/compare/main...Shnatsel:image-webp:faster-alpha-blending?expand=1

By the way, @awxkee in your div_by_255 the .min(255) part is completely unnecessary. Instead of outputting 256 it already outputs 0, so you don't need any clamping. My branch includes an exhaustive correctness test for that function, you can try it for yourself. I kept the .min in my branch out of the abundance of caution, but I am convinced that it does nothing.

Shnatsel avatar Dec 18 '24 14:12 Shnatsel

min was accidentally added when I was trying to force auto-vectorization. You're correct that the division is exact, so clamping is unnecessary.

I think your rounding_div will be incredibly slow to perform two integer divisions at one time. Maybe it’s worth trying to generate a reduction table since the denominator range is small [1,255], to perform Montgomery multiplication? As LLVM does when dividing by constant https://godbolt.org/z/qnE5E8Eo3, it performs Montgomery multiplication in form b=(a*R)>>k. Additionally, there are two divisions at the same time for the same denominator, so each step will require only one read from the reduction tables.

awxkee avatar Dec 18 '24 14:12 awxkee

And also it is possible to make special branch for numbers power of 2 for even faster division:

fn is_power_of_two(n: u32) -> bool {
    n != 0 && (n & (n - 1)) == 0
}

fn power_of_two(n: u32) -> u32 {
    n.trailing_zeros()
}

And then perform: value >> power_of_two

awxkee avatar Dec 18 '24 15:12 awxkee

Here is table to try out: https://gist.github.com/awxkee/b8df92e4f97346fb56ae91dc4dcca779 You can benchmark if it worth it on your CPU, because I'm actually not sure that it'll be better.

awxkee avatar Dec 18 '24 15:12 awxkee

Thank you! I'll benchmark that and dig deeper into the performance of these things once we actually have a working alpha blending routine. Right now I'm not even sure if u16 is even enough for alpha blending to work, so I don't want to over-invest into optimizing something I might throw away later.

Shnatsel avatar Dec 18 '24 15:12 Shnatsel

Okay, I checked how libwebp does it, and they actually do it in u32 rather than u16:

https://github.com/webmproject/libwebp/blob/e4f7a9f0c7c9fbfae1568bc7fa5c94b989b50872/src/demux/anim_decode.c#L215-L267

We should probably just port that.

Shnatsel avatar Dec 18 '24 19:12 Shnatsel

I've ported the libwebp algorithm. It is really inaccurate at low alpha levels but nobody is going to notice that anyway. It gives a 8% end-to-end performance boost on this sample.

https://github.com/Shnatsel/image-webp/blob/e9a8a6710db5d4af193312ad4983681068bc5534/src/alpha_blending.rs#L5-L52

My conversion of the f64 algorithm using u16 is more accurate than the libwebp blending routine, but is also a lot slower than even f64. However, the loop it's in is really not conducive to vectorization. If I can get the compiler to vectorize it and process 8 values at once via vector instructions, that might change.

Realistically though, the libwebp blending is still going to be faster, if we're willing to take the accuracy hit at very low alpha values.

Shnatsel avatar Dec 18 '24 21:12 Shnatsel

You could perhaps replace their 'approximate division by 255' with exact division? I think this is the main inaccuracies producer. Also here should be nice work division by reduction table, because you may perform 3 accurate divisions with one table read, however I'm not sure then about speed, this is something really needs to be checked.

awxkee avatar Dec 18 '24 21:12 awxkee

I turned an assert! into a debug_assert! and that must have unlocked some huge optimizations because decoding is now 16% faster end-to-end, so the alpha blending function must be ~5x faster

https://github.com/Shnatsel/image-webp/blob/c0cfcb1d2afbb581e8a5b25614d6f9cecfb497d5/src/alpha_blending.rs#L6-L59

I also re-added support for big endian. C doesn't have a [u8; 4] type so they have to do cursed endianness-dependent bit shifts instead.

Shnatsel avatar Dec 18 '24 21:12 Shnatsel

@awxkee I've replaced libwebp's division approximation with your div_by_255 and got improved precision without sacrificing performance!

Combined with the image_webp::vp8::Frame::fill_rgba optimization in #122, we're now 27% faster end-to-end on this image compared to where we started.

Shnatsel avatar Dec 19 '24 10:12 Shnatsel

Actually if you're using u32 instead of u16 there will be a shorter reach for division by 255 = (a * 32897) >> 23. This method in u16 makes more sense for SIMD vectors in u16 type.

awxkee avatar Dec 19 '24 10:12 awxkee

That method results in a less precise approximation of the floating-point division, and I'm seeing a greater divergence from the floating-point reference. I believe the trick with the other div_by_255 is that it also rounds to nearest, instead of always rounding down.

Shnatsel avatar Dec 19 '24 10:12 Shnatsel

I've dumped the frames from this image and found that the alpha channel only has two states: fully transparent and fully opaque. It makes sense for animated images to be encoded this way, if you think about it: you can keep the unchanged pixels from the previous frame and only encode new ones. That's what GIF does too.

Despite that, we're still doing the proper expensive alpha blending for each frame, even though we could get away with simply selecting the pixel from previous or current frame depending on the alpha channel value. We don't even need to run an expensive scan of the entire alpha channel - it's encoded as indexed data! We can cheaply inspect the alpha channel's index table, realize that each pixel is either fully transparent or fully opaque, and just run a cmov in a loop instead of complex math.

Alpha blending accounts for 11% of the decoding time on this image as of image-webp v0.2.1: https://share.firefox.dev/3Sx44IL This change would become much more impactful if it decoding the frames was parallelized as per #120

Shnatsel avatar Jun 02 '25 16:06 Shnatsel