Sergey "Shnatsel" Davidoff
Sergey "Shnatsel" Davidoff
I think I've overcomplicated parallelizing animations. **Having just two threads - one for decoding, one for compositing - is going to be almost as good as it's going to get.**...
I looked at the profile and the associated code a bit. The two low-hanging optimization opportunities are: 1. Applying the YUV->RGB optimization from #13 to the [RGBA codepath](https://github.com/image-rs/image-webp/blob/ecead22637f625a144830aff6c05b02d185a5d00/src/vp8.rs#L921-L944) as well...
Paper on fast alpha blending without divisions: https://arxiv.org/pdf/2202.02864
I've attempted to optimize alpha blending by performing it in u16 instead of f64. I got the primitives working (rounding integer division both by 255 and by an arbitrary u8)...
Thank you! I'll benchmark that and dig deeper into the performance of these things once we actually have a working alpha blending routine. Right now I'm not even sure if...
Okay, I checked how libwebp does it, and they actually do it in `u32` rather than `u16`: https://github.com/webmproject/libwebp/blob/e4f7a9f0c7c9fbfae1568bc7fa5c94b989b50872/src/demux/anim_decode.c#L215-L267 We should probably just port that.
I've ported the libwebp algorithm. It is really inaccurate at low alpha levels but nobody is going to notice that anyway. It gives a 8% end-to-end performance boost on this...
I turned an `assert!` into a `debug_assert!` and that must have unlocked some huge optimizations because decoding is now 16% faster end-to-end, so the alpha blending function must be ~5x...
@awxkee I've replaced libwebp's division approximation with your `div_by_255` and got improved precision without sacrificing performance! Combined with the `image_webp::vp8::Frame::fill_rgba` optimization in #122, we're now 27% faster end-to-end on this...
That method results in a less precise approximation of the floating-point division, and I'm seeing a greater divergence from the floating-point reference. I believe the trick with the other `div_by_255`...