Bulk processed pixel conversions
Adding internal API to bulk process pixel conversions with per-conversion friendly SIMD/SWIR would speed up many image ops in this crate. I would like to hear some feedback on the general concept and its applications for the future.
Draft
I'm willing to take a shot at implementing these APIs as far as possible, I have done some bikeshedding and benchmarks as seen here. I have observed a 3x speedup when converting a 16mb f32 buffer to u8, while taking these new constraints into account. Generally, most conversion APIs would need to be extended like this, though this new API would not be mandatory to implement as a scalar fallback should always be possible.
I appreciate the extension for minimalism alone, very nice if that default'able extension method to an internal trait would work out to 3× speedups for some of the higher-level operations. In general, yes, anything that can be designed for batching but isn't deserves a critical look.
What's the main hold-up for turning your linked implementation sketch and commit into a PR?
I appreciate the extension for minimalism alone, very nice if that default'able extension method to an internal trait would work out to 3× speedups for some of the higher-level operations. In general, yes, anything that can be designed for batching but isn't deserves a critical look.
What's the main hold-up for turning your linked implementation sketch and commit into a PR?
The holdup is calling this method internally, as most APIs only call from_color 3-4 times (once per channel or alpha) per pixel at a time. Right now im attempting to further modify upstream APIs to support calling into this new bulk functions. Once that is complete ill draft a PR and get feedback on some of the decisions that have to be made.
An example issue would be FromColor as already mentioned:
fn from_color_bulk(output: &mut [Self], input: &[Rgb<f32>])
where
Self: Sized,
{
// Safe because `Foo` is `#[repr(transparent)]`
let input: &[f32] = unsafe {
std::slice::from_raw_parts(
input.as_ptr() as *const f32,
input.len() * std::mem::size_of::<f32>(),
)
};
// Same goes for T
let output: &mut [u8] = unsafe {
std::slice::from_raw_parts_mut(
output.as_ptr() as *mut u8,
output.len() * std::mem::size_of::<u8>(),
)
};
u8::from_bulk_primitive(input, output);
}
Here i have to somehow turn a buffer of Rgb<f32> into &[f32] (the same goes for output buffer). This current 'solution' is just slapped together unsafe (assuming Rgb is repr Transparent among other things).
I'm supportive of this investigation, though we should make sure to benchmark carefully and to double check that the same performance gains aren't possible the current interfaces.
I poked around in godbolt a bit with the from_primitive method from your PR. It didn't vectorize cleanly as written, but switching from using the round function to + 0.5 before casting seems to mostly get it to. There's a subtle difference between the SIMD and non-SIMD round methods that may be the cause: On f32 ties round away from zero, while SimdFloat rounds them towards zero.
Another point to mention is that std::simd is curently nightly-only so we can't use it except behind a non-default feature flag. And even there, we'd want a substantial performance difference to justify the added overhead of dealing with unstable (and potentially changing) library methods.
So far we've avoided ever dealing with &[Rgb<f32>] or similar slices, so I'm not thrilled about changing that. For now using unsafe casts seems fine, but I'd want to see performance numbers/consider alternatives before making a final decision.
I'm supportive of this investigation, though we should make sure to benchmark carefully and to double check that the same performance gains aren't possible the current interfaces.
I poked around in godbolt a bit with the
from_primitivemethod from your PR. It didn't vectorize cleanly as written, but switching from using theroundfunction to+ 0.5before casting seems to mostly get it to. There's a subtle difference between the SIMD and non-SIMD round methods that may be the cause: Onf32ties round away from zero, whileSimdFloatrounds them towards zero.Another point to mention is that
std::simdis curently nightly-only so we can't use it except behind a non-default feature flag. And even there, we'd want a substantial performance difference to justify the added overhead of dealing with unstable (and potentially changing) library methods.
This is why im still toying around with all of this. For a possible merge i would use portable-simd. Ill try and get this implemented into a working state and present a proper benchmark to compare any possible gains.
Some simple preliminary (and unscientific) benchmarks show a significant improvement for a 4096² pixel image (f32 -> u8), which is the motivation of my efforts.
(AVX-512 CPU for native run)
SIMD will be much better here on x86 since Rust make a lot efforts on f32 -> u8, with NEON conversion is done in one instruction and there will be no difference, some improvements will be only for x86 where conversion f32 -> u8 goes without explicit saturation and Rust makes additional checks.
You can check the time that Rust spending to check that it won't cause an UB on x86:
#[inline(always)]
fn from_primitive(float: f32) -> u8 {
unsafe {
(normalize_float(float, u8::MAX as f32) + 0.5).to_int_unchecked::<u8>()
}
}
On aarch64 SIMD versions highly likely will be worse than Rust codegen.
Other things in loop should not make real overhead.
Who wants to check:
https://godbolt.org/z/YYGqjzasf
https://developer.arm.com/architectures/instruction-sets/intrinsics/vcvtaq_u32_f32
SIMD will be much better here on x86 since Rust make a lot efforts on f32 -> u8, with NEON conversion is done in one instruction and there will be no difference, some improvements will be only for x86 where conversion f32 -> u8 goes without explicit saturation and Rust makes additional checks.
You can check the time that Rust spending to check that it won't cause an UB on x86:
#[inline(always)] fn from_primitive(float: f32) -> u8 { unsafe { (normalize_float(float, u8::MAX as f32) + 0.5).to_int_unchecked::<u8>() } }On
aarch64SIMD versions highly likely will be worse than Rust codegen.Other things in loop should not make real overhead.
I observed that the compiler can actually emit pretty fast SIMD code for scalar implementations, given that proper slice sizes are used (such as casting Rgb). So it might be enough to just provide scalar bulk ops and let the compiler take the wheel.
I believe this should be packaged into three stages of PR/Work that ill explain below, that i would like to hear some thoughts on. Ill end each case with progress that i have made in that direction.
- Define Bulk API such that any conversion could use use it, but only implement (or rather, keep) the trait provided default impl (scalar). This change should be ready to PR in less than a week if i find a few free hours.
- Create specialized implementations for specific Pixel types using scalar implementations, applying unsafe where necessary, to enable the compiler to see past Pixel types when optimizing. Possible in the near future, but requires manual auditing and additional testing (compared to my slapped together POC)
- Investigate if we even need SIMD on individual channel-to-channel conversions (as i have observed the compiler/LLVM producing quite fast SIMD from scalar implementations on its own). Should be very doable once step 2 is available where SIMD makes sense (f32 -> u8/u16), though the question regarding nightly requirements for std::simd needs to be answered.
I think we should start by adding a benchmark of the user-facing functionality that we want to optimize. In this case, I believe it would either be measuring either ImageBuffer::convert or one/several of the DynamicImage::into_XXX methods. This will serve as a baseline and enable us to evaluate the impact of subsequent changes.
From there, the next step would be a proof-of-concept implementation showing that the overall performance difference that is possible. It is also an opportunity to experiment with whether there's other ways to get similar performance gains.
If results from the previous step are compelling, then we can figure out the right sequence to merge things in. Your proposed sequence of first defining the API, then adding specialized implementations, and following up with further optimizations where beneficial seems broadly reasonable. This is also where we'd have to answer questions about usage of unsafe (ideally we'd avoid unless absolutely necessary) and using explicit SIMD
This can probably be implemented in terms of the internal interfaces of: #2531 In fact, that's the internal implementation which I'll gladly benchmark. Providing a slice-to-mut-slice interface on top of that PR looks pretty feasible. I'm only slightly uncertain in terms of interface. From the experience with that PR it pays off not to write a generic interface in terms of an existing trait here (e.g. T: Primitive, O: Primitive) since that effectively disallows selecting more efficient implementations that only work for a (finite) subset of these. Using PixelWithColorType as a private bound as in that PR on the other hand might work.
Awesome! I will see to it soon and adapt my current implementations to said interface.
Regarding SIMD also take a look at https://github.com/okaneco/safe_unaligned_simd as on-stable but more unsafe alternative to std::simd. Some previous experiments with std::simd showed that it left quite a bit of performance on the table due to being too generic as well. (Somewhat similar to the trait interface problem, it can have interface supported by all architectures which often leaves gaps due to architecture specific operations; as well as making cycle-timing and throughput tuned instruction selection quite opaque despite that making up a good deal of well-optimized SIMD performance.