image-webp icon indicating copy to clipboard operation
image-webp copied to clipboard

Vectorize color transform

Open fintelia opened this issue 2 months ago • 4 comments

From my benchmarking, the color transform takes an average of 8.5% of the total decode time for lossless images. On the same set of images, libwebp spends 1.5% of total decode time thanks to its SSE2 vectorized implementation. This difference and the time spent in the (much simpler) subtract green transform accounts for nearly all the performance disparity between this crate and libwebp.

fintelia avatar Oct 17 '25 06:10 fintelia

I implemented the transform in SSE2 intrinsics https://github.com/okaneco/image-webp/commit/42b4a1f7d6d6ca5921e0336ce3bd41ebcb84a70c, and the functionality is behind a feature so users can opt-in to the "unsafe" optimization.

On the benchmark, it seems to be between a 3-4x improvement but was a little noisy on the exact amount. Unsure of end-to-end time.

// cargo bench --features=_benchmarks -- subtract_green
test lossless_transform::benches::subtract_green ... bench: 447.32 ns/iter (+/- 13.61) = 9163 MB/s
// cargo bench --features=_benchmarks,sse_simd -- subtract_green
test lossless_transform::benches::subtract_green ... bench: 131.91 ns/iter  (+/- 8.20) = 31267 MB/s
test lossless_transform::benches::subtract_green ... bench: 115.55 ns/iter  (+/- 3.65) = 35617 MB/s

bytemuck is used to convert between arrays and register types: setting SIMD registers is safe but using the casting code is less verbose. Unsafe is needed to call into anything with a target feature even if the target feature is part of the baseline, which is the case for sse2 and x86_64. This ended up with 2 calls using unsafe.

okaneco avatar Oct 19 '25 18:10 okaneco

Thanks for looking into this! To be clear, there's two transforms that are relevant here: the color transform (8.5% of total time) and the subtract green transform (2.2% of total time).

I tested your SSE2 subtract green implementation as well as a (nightly only) std::simd implementation I had from before, and both drop the time spent to 0.9% of the total. That's a ~1% difference in the end-to-end performance. libwebp's implementation actually is a little faster at only 0.5% of the total time, probably since they're more clever using shuffles.

It is also possible to vectorize the color transform (libwebp's version is here) which looks like it should be an even bigger impact.

I guess the question is the best path forward. If we knew portable_simd was going to be stable within a few months, then that would be the natural choice. But is it worth having it behind a nightly-only feature indefinitely? Going with platform specific SIMD would mean we'd lose the current forbid(unsafe_code), at least for some indefinite amount of time until the standard library gains a way of safely doing what we want. And having truly zero unsafe code is valuable, since it is an easy line to draw and if you're OK with well-audited but unsafe code then there's already libwebp.

fintelia avatar Oct 20 '25 01:10 fintelia

To be clear, there's two transforms that are relevant here: the color transform (8.5% of total time) and the subtract green transform (2.2% of total time).

I misunderstood that from the original message. Agree on vectorizing the color transform, I might look into doing that for fun even if it won't get merged.

Are you using your corpus-bench repo for these benchmarks? It'd be helpful to know the process.


I think there's a case to be made for using platform intrinsics but I'd be fine if this crate stayed 100% safe.

If 100% safe is chosen, then I think it's worth exploring portable_simd to see what performance the crate is capable of and waiting out to stabilization.

There are 3 points in your last paragraph that I want to extract to make for easier discussion.
  1. Using portable_simd when its stabilization timeline is uncertain
  2. Losing forbid(unsafe_code) until we get some safe std way of calling platform intrinsics
  3. The value of having 0 lines of unsafe code in a crate

1 - portable_simd still seems very far away if you've checked the recent status of the repo. They've been progressing again but I think they're reworking how masks are to be handled (and might require more work upstream, rust-lang/portable-simd/issues/472 for discussion). They just landed some other changes in the compiler and in the repo to remove LaneCount so you don't have to write where LaneCount<N>: SupportedLaneCount bounds everywhere. Beyond the masks, SIMD traits are being re-thought (discussion in the zulip but nothing conclusive) and need to be finalized before making an RFC to add to stable Rust.

This is roughly how I understand the roadmap to stabilization.

2 - I want this too, but there isn't even a sketch for this so it seems too far off for me to consider as a realistic option. Ideally, there's a feature detection function that could return a token which lets us safely call into target feature functions, or this can be determined at compiletime for the target. Relevant work I'm aware of are these RFCS. The second has been implemented as an experimental feature. rust-lang/rfcs/pull/3525 - Struct target features RFC rust-lang/rfcs/pull/3820 - RFC: Changes to target_feature attribute

3 - It's unfortunate to lose forbid(unsafe_code) for all coverage but with cfg_attr, we can still forbid(unsafe_code) for default usage of the crate maintaining 100% safe crate code in that configuration.

3 is the main point to consider, whether to remain 100% safe or not.

I think platform intrinsic SIMD use can be justified when the limits of auto-vectorization have been reached and there's still performance left on the table. There are different levels of unsafe usage to me, and calling into a target_feature function with safe SIMD intrinsics is at a much lower risk level than using raw pointers. The unsafe is more of a formality for papering over a language defect rather than using raw pointers to read, write, or cast between types. There's still 100% safe code paths in the library and the optimizations would be opt-in.

For 1, I rather see something that works on stable than rely on a nightly feature for an indefinite amount of time. I think there should be less of a stigma against using intrinsics because it's much more trivial to use correctly than unsafe code that's doing things with pointers or other usages of unsafe. It's also a much more reliable way to get performance than rely on the sometimes fragile auto-vectorization.

okaneco avatar Oct 20 '25 04:10 okaneco

There have to be smarter ways to do it, but I implemented the color transform in SSE2 https://github.com/okaneco/image-webp/commit/1999949b9c4716ff27f97359f7f883ac34f3d0e5.

libwebp appears to have chosen BGRA byte order while we use RGBA, so the implementation and performance improvements may differ. I couldn't quite follow the pre-shifting implementation they did because of the byte order difference, so that may be a future improvement and something to experiment with in the scalar implementation.

~2x improvement on the bench

// cargo bench --features=_benchmarks -- color_transform
test lossless_transform::benches::color_transform ... bench:      75,126.36 ns/iter (+/- 2,505.18) = 3489 MB/s
// cargo bench --features=_benchmarks,sse_simd -- color_transform
test lossless_transform::benches::color_transform ... bench:      37,645.20 ns/iter (+/- 2,809.00) = 6963 MB/s

I updated the green transform to use the less naive shuffle and that was an improvement, edging slightly closer to the 0.5% time spent that you mention from profiling. Implemented here

Previous sse2 performance for this was ~120-130 ns/iter.

// cargo bench --features=_benchmarks -- subtract_green
test lossless_transform::benches::subtract_green ... bench:         439.05 ns/iter (+/- 25.93) = 9330 MB/s
// cargo bench --features=_benchmarks,sse_simd -- subtract_green
test lossless_transform::benches::subtract_green ... bench:          95.81 ns/iter (+/- 6.89) = 43115 MB/s

okaneco avatar Oct 27 '25 01:10 okaneco