Decoding animated WebP is 4x slower than `libwebp-sys` + `webp-animation`
In image v0.25.4 and image-webp v0.2.0, decoding the attached animated WebP is 4x slower than using libwebp-sys + webp-animation: sample.zip
image
use std::error::Error;
use image::{codecs::webp::WebPDecoder, AnimationDecoder, ImageReader};
fn main() -> Result<(), Box<dyn Error>> {
let input = std::env::args().nth(1).unwrap();
let reader = ImageReader::open(input)?.into_inner();
let decoder = WebPDecoder::new(reader)?;
let mut iter = decoder.into_frames();
while let Some(_frame) = iter.next() {}
Ok(())
}
hyperfine results:
Time (mean ± σ): 411.9 ms ± 3.9 ms [User: 372.2 ms, System: 38.8 ms]
Range (min … max): 407.6 ms … 416.8 ms 10 runs
libwebp-sys + webp-animation
use std::error::Error;
use webp_animation::prelude::*;
fn main() -> Result<(), Box<dyn Error>> {
let input = std::env::args().nth(1).unwrap();
let buffer = std::fs::read(input).unwrap();
let decoder = Decoder::new(&buffer).unwrap();
let mut iter = decoder.into_iter();
while let Some(_frame) = iter.next() {}
Ok(())
}
hyperfine results:
Time (mean ± σ): 95.9 ms ± 0.4 ms [User: 128.7 ms, System: 7.4 ms]
Range (min … max): 95.3 ms … 96.7 ms 30 runs
Analysis
webp-animation shows a bit of multi-threading happening on the profile, with user time being longer than the total execution time, but even accounting for that image-webp is 3x slower.
Breakdown of where the time is spent in image, recorded by samply: https://share.firefox.dev/4fc3utg
The greatest contributors seem to be image_webp::vp8::Vp8Decoder::decode_frame (48%), image_webp::extended::do_alpha_blending (20%), image_webp::vp8::Frame::fill_rgba (16%).
Within decode_frame the biggest contributor is image_webp::vp8::Vp8Decoder::read_coefficients (12% self time, 32% total time), and the code of that function looks like it could be optimized further to reduce bounds checks, etc. #71 is also relevant, but only accounts for 20% of the total time.
I looked at the profile and the associated code a bit. The two low-hanging optimization opportunities are:
- Applying the YUV->RGB optimization from #13 to the RGBA codepath as well
- The alpha blending currently uses an
f64to blend everyu8. No wonder that's slow - it kills vectorization. We should make it operate in integers (u16might be enough?) and usechunks_exact()to get the compiler to vectorize it.
Paper on fast alpha blending without divisions: https://arxiv.org/pdf/2202.02864
Paper on fast alpha blending without divisions: https://arxiv.org/pdf/2202.02864
Here is ready method if needed https://github.com/awxkee/pic-scale-safe/blob/7f1066e006aac5eea4c1474f5bbe81cae837b2b7/src/alpha.rs#L43
Division by alpha if required much less straightforward.
I've attempted to optimize alpha blending by performing it in u16 instead of f64. I got the primitives working (rounding integer division both by 255 and by an arbitrary u8) but when I assemble the entire thing with them it falls apart. Pretty early, too - calculating the resulting alpha is already wrong, we don't even get to the interesting part of blending the RGB values.
This is what I've got so far: https://github.com/image-rs/image-webp/compare/main...Shnatsel:image-webp:faster-alpha-blending?expand=1
By the way, @awxkee in your div_by_255 the .min(255) part is completely unnecessary. Instead of outputting 256 it already outputs 0, so you don't need any clamping. My branch includes an exhaustive correctness test for that function, you can try it for yourself. I kept the .min in my branch out of the abundance of caution, but I am convinced that it does nothing.
min was accidentally added when I was trying to force auto-vectorization. You're correct that the division is exact, so clamping is unnecessary.
I think your rounding_div will be incredibly slow to perform two integer divisions at one time.
Maybe it’s worth trying to generate a reduction table since the denominator range is small [1,255], to perform Montgomery multiplication?
As LLVM does when dividing by constant https://godbolt.org/z/qnE5E8Eo3, it performs Montgomery multiplication in form b=(a*R)>>k.
Additionally, there are two divisions at the same time for the same denominator, so each step will require only one read from the reduction tables.
And also it is possible to make special branch for numbers power of 2 for even faster division:
fn is_power_of_two(n: u32) -> bool {
n != 0 && (n & (n - 1)) == 0
}
fn power_of_two(n: u32) -> u32 {
n.trailing_zeros()
}
And then perform: value >> power_of_two
Here is table to try out: https://gist.github.com/awxkee/b8df92e4f97346fb56ae91dc4dcca779 You can benchmark if it worth it on your CPU, because I'm actually not sure that it'll be better.
Thank you! I'll benchmark that and dig deeper into the performance of these things once we actually have a working alpha blending routine. Right now I'm not even sure if u16 is even enough for alpha blending to work, so I don't want to over-invest into optimizing something I might throw away later.
Okay, I checked how libwebp does it, and they actually do it in u32 rather than u16:
https://github.com/webmproject/libwebp/blob/e4f7a9f0c7c9fbfae1568bc7fa5c94b989b50872/src/demux/anim_decode.c#L215-L267
We should probably just port that.
I've ported the libwebp algorithm. It is really inaccurate at low alpha levels but nobody is going to notice that anyway. It gives a 8% end-to-end performance boost on this sample.
https://github.com/Shnatsel/image-webp/blob/e9a8a6710db5d4af193312ad4983681068bc5534/src/alpha_blending.rs#L5-L52
My conversion of the f64 algorithm using u16 is more accurate than the libwebp blending routine, but is also a lot slower than even f64. However, the loop it's in is really not conducive to vectorization. If I can get the compiler to vectorize it and process 8 values at once via vector instructions, that might change.
Realistically though, the libwebp blending is still going to be faster, if we're willing to take the accuracy hit at very low alpha values.
You could perhaps replace their 'approximate division by 255' with exact division? I think this is the main inaccuracies producer. Also here should be nice work division by reduction table, because you may perform 3 accurate divisions with one table read, however I'm not sure then about speed, this is something really needs to be checked.
I turned an assert! into a debug_assert! and that must have unlocked some huge optimizations because decoding is now 16% faster end-to-end, so the alpha blending function must be ~5x faster
https://github.com/Shnatsel/image-webp/blob/c0cfcb1d2afbb581e8a5b25614d6f9cecfb497d5/src/alpha_blending.rs#L6-L59
I also re-added support for big endian. C doesn't have a [u8; 4] type so they have to do cursed endianness-dependent bit shifts instead.
@awxkee I've replaced libwebp's division approximation with your div_by_255 and got improved precision without sacrificing performance!
Combined with the image_webp::vp8::Frame::fill_rgba optimization in #122, we're now 27% faster end-to-end on this image compared to where we started.
Actually if you're using u32 instead of u16 there will be a shorter reach for division by 255 = (a * 32897) >> 23.
This method in u16 makes more sense for SIMD vectors in u16 type.
That method results in a less precise approximation of the floating-point division, and I'm seeing a greater divergence from the floating-point reference. I believe the trick with the other div_by_255 is that it also rounds to nearest, instead of always rounding down.
I've dumped the frames from this image and found that the alpha channel only has two states: fully transparent and fully opaque. It makes sense for animated images to be encoded this way, if you think about it: you can keep the unchanged pixels from the previous frame and only encode new ones. That's what GIF does too.
Despite that, we're still doing the proper expensive alpha blending for each frame, even though we could get away with simply selecting the pixel from previous or current frame depending on the alpha channel value. We don't even need to run an expensive scan of the entire alpha channel - it's encoded as indexed data! We can cheaply inspect the alpha channel's index table, realize that each pixel is either fully transparent or fully opaque, and just run a cmov in a loop instead of complex math.
Alpha blending accounts for 11% of the decoding time on this image as of image-webp v0.2.1: https://share.firefox.dev/3Sx44IL This change would become much more impactful if it decoding the frames was parallelized as per #120