ImageSharp
ImageSharp copied to clipboard
Performance optimization opportunities in common pixel formats.
Prerequisites
- [X] I have written a descriptive issue title
- [X] I have verified that I am running the latest version of ImageSharp
- [X] I have verified if the problem exist in both
DEBUG
andRELEASE
mode - [X] I have searched open and closed issues to ensure it has not already been reported
ImageSharp version
v3 alpha +
Other ImageSharp packages and versions
NA
Environment (Operating system, version and so on)
NA
.NET Framework version
NA
Description
As described here there are several performance opportunities can be implemented in many of our pixel format types. This should be fairly low hanging fruit with good return.
Notably on .NET 6/7, you could make this even more efficient by doing something like:
public void Pack(Vector4 vector)
{
vector *= MaxBytes;
vector += Half;
vector = Vector4.Clamp(vector, Vector4.Zero, MaxBytes);
Vector128<byte> result = Sse2.ConvertToVector128Int32WithTruncation(vector.AsVector128()).AsByte();
// In .NET 7+ the above can be `result = Vector128.ConvertToInt32(vector.AsVector128()).AsByte()` so it works on Arm64 too
R = result.GetElement(0);
G = result.GetElement(4);
B = result.GetElement(8);
A = result.GetElement(12);
}
This converts all 4 elements at once and then extracts the truncated bytes directly:
vzeroupper
vmovupd xmm0, [0x7ffd160105c0]
vmovaps xmm1, xmm0
vmulps xmm1, xmm1, [rdx]
vmovupd [rdx], xmm1
vmovupd xmm1, [0x7ffd160105d0]
vaddps xmm1, xmm1, [rdx]
vmovupd [rdx], xmm1
vmovupd xmm1, [rdx]
vxorps xmm2, xmm2, xmm2
vmaxps xmm1, xmm1, xmm2
vminps xmm0, xmm1, xmm0
vmovupd [rdx], xmm0
vcvttps2dq xmm0, [rdx]
vpextrb eax, xmm0, 0
mov [rcx+2], al
vpextrb eax, xmm0, 4
mov [rcx+1], al
vpextrb eax, xmm0, 8
mov [rcx], al
vpextrb eax, xmm0, 0xc
mov [rcx+3], al
ret
- We'll also be improving the codegen around
vpextrb
more in the future so it can be justvpextrb [rcx+2], xmm0, 0
instead ofvpextrb eax, xmm0, 0
followed bymov [rcx+2], al
.
You can also optimize in .NET 6+ by directly using Vector128.Create()
. This creates a method local constant and avoids the static initializer entirely:
private static Vector4 MaxBytes => Vector128.Create(255f).AsVector4();
private static Vector4 Half => Vector128.Create(0.5f).AsVector4();
Steps to Reproduce
NA
Images
No response
ImageSharp is .NET 6+ now, correct?
If you want to assign this to me, I can work on getting a fix up and take a peek at some of the other SIMD code for other .NET 6+ specific improvements.
ImageSharp is .NET 6+ now, correct?
If you want to assign this to me, I can work on getting a fix up and take a peek at some of the other SIMD code for other .NET 6+ specific improvements.
@tannergooding yes it's .NET6+ now. I have assigned you to this task, thanks!
@tannergooding Any and all suggestions will be deeply appreciated 😄
Few questions...
- How would you like me to structure the PR(s)?
- What's the consideration for
#ifdef
vs providing helper methods? - What's the recommended way to benchmark and present numbers as part of a PR?
- How much "freedom" is there in changing internal implementation details of structs/methods?
For 1, I'm mostly asking as I can try and do several "smallish" PRs each targeting a specific type or I can do large PRs that try and cover changes more holistically across the repo.
For 2, mostly asking what the preferred way of handling cases such as where Arm64
support might be added. For .NET 6, we have to directly use AdvSimd
. While for .NET 7, you can just use the methods/operators directly on Vector128<float>
and the JIT will do the "optimal" thing for x64 or AdvSimd or WASM or other future platforms.
For 3, naturally I'd like to show numbers actually indicating changes are wins and in a format expected by the repo. I didn't see any callouts in the contributing docs, but maybe I missed it.
For 4, as an example there are a number of structs that contain 2/3/4 float fields (often exposed via auto-props). Due to the JIT specializing Vector2/3/4 but not user-defined structs today (hopefully I can get this fixed eventually, so all qualifying types get this same specialization), you will likely get better codegen by wrapping a single Vector2/3/4
instead.
Additionally there is the potential to wrap Vector128<float>
for "even better" perf. However, the last one comes with a consideration that the type can no longer be used for interop, if that's a consideration at all. -- There is an easy workaround of getting the backing Vector128<float>
field and using .AsVector4()
if interop is a consideration, but it is something to think about.
There's also a slightly more complex consideration in that while Vector2/3/4
are currently accelerated and generally performant, types like Matrix3x2
, Matrix4x4
, and other neighboring types are not (I'm going to try and improve that in .NET 8, but no guarantees). I could write-up more performant helpers and utilize Unsafe.As
to achieve significant perf wins for .NET 6/7 without changing the public surface area from taking the System.Numerics
types.
Also some cases where Vector128<float>
is a better choice than Vector4
in .NET 6, but where they are effectively the same in .NET 7. The .AsVector4()
and .AsVector128()
APIs allow zero-cost conversion in .NET 6+, so using Vector128<float>
in internal methods can provide wins.
Apologies for the slow reply, was offine this weekend.
I'll try to summarise a response to each question for you.
- Since the focus is mostly on pixel formats here, I think we can err towards the larger side. I can see a lot of repitition on the horizon so it should be ok to review.
- I'd like to avoid
#ifdef
where possible since we are only targetting the single framework. If we addArm64
(_something we really need to get a performant build process in place for - QEMU is far too slow!) we can useAdvSimd
and when .NET 8 comes around, we'll switch target to that and incrementally replace the methods to use the new APIs as and when we can. - Benchmarking is a little adhoc. There is a folder in the benchmark project under
General/PixelConversion
where you can place benchmarks in whatever format you see fit in. - You have absolute freedom there, as long as the changes do not make maintaining the codebase more difficult. I'd personally love to put a lot of focus into streamlining and simplifying some of the internal code but there's still some other chunky tasks - #1433 - that need to be done.
Thanks again for your help here!
Thanks a lot for the responses, they all make sense to me!
Just wanted to give an update on this. I've been familiarizing myself with the codebase and the various SIMD code, making notes of possible improvements as I go.
I'll be closing on my first house and moving over the next couple weeks so will endup taking a short break, but hope to have a PR up and a bullet list of potential additional improvements closer to the end of the month.
Congratulations on the house!
No worries, enjoy your break. Looking forward to seeing what you come up with!
-- Am still working on this. Codegen wasn't doing what I expected so I spent some time fixing that up in the JIT (believe you've seen the PR already 😄).
With the changes I've made, the codegen on my local changes is significantly better. So should provide some nice wins off the bat and even better gains once 8.0 comes around.
Yeah, I saw that PR 😃 awesome stuff!
NET 8 seems like it's going to allow massive improvements to the ImageSharp codebase. Simplified SIMD with out of the box ARM support will be a gamechanger.
Also got a rewrite of Matrix4x4 and Matrix3x2 in: https://github.com/dotnet/runtime/pull/80091
This resulted in perf improvements of 2x up to 48x and should have a huge positive impact on ImageSharp. There is also opportunity for me to rewrite/improve Plane and Quaternion, but I didn't see any broad impact in my initial profile captures.
I plan on rerunning ImageSharp perf benchmarks once I have a dotnet/installer
build available, but that represents the largest point of unexpected perf I was seeing. After that, I can finish up on the changes I've been working on here and it should be even better.
Haha.... You're a whole new level of awesome. I was expecting a few small changes; you're rewriting the runtime for the benefit of all. Truly amazing!
I think our ColorMatrix type can benefit from copying those changes.