ImageSharp Performance optimization opportunities in common pixel formats.

Prerequisites

[X] I have written a descriptive issue title
[X] I have verified that I am running the latest version of ImageSharp
[X] I have verified if the problem exist in both DEBUG and RELEASE mode
[X] I have searched open and closed issues to ensure it has not already been reported

ImageSharp version

v3 alpha +

Other ImageSharp packages and versions

NA

Environment (Operating system, version and so on)

NA

.NET Framework version

NA

Description

As described here there are several performance opportunities can be implemented in many of our pixel format types. This should be fairly low hanging fruit with good return.

Notably on .NET 6/7, you could make this even more efficient by doing something like:

public void Pack(Vector4 vector)
{
    vector *= MaxBytes;
    vector += Half;
    vector = Vector4.Clamp(vector, Vector4.Zero, MaxBytes);

    Vector128<byte> result = Sse2.ConvertToVector128Int32WithTruncation(vector.AsVector128()).AsByte();
    // In .NET 7+ the above can be `result = Vector128.ConvertToInt32(vector.AsVector128()).AsByte()` so it works on Arm64 too

    R = result.GetElement(0);
    G = result.GetElement(4);
    B = result.GetElement(8);
    A = result.GetElement(12);
}

This converts all 4 elements at once and then extracts the truncated bytes directly:

vzeroupper
vmovupd xmm0, [0x7ffd160105c0]
vmovaps xmm1, xmm0
vmulps xmm1, xmm1, [rdx]
vmovupd [rdx], xmm1
vmovupd xmm1, [0x7ffd160105d0]
vaddps xmm1, xmm1, [rdx]
vmovupd [rdx], xmm1
vmovupd xmm1, [rdx]
vxorps xmm2, xmm2, xmm2
vmaxps xmm1, xmm1, xmm2
vminps xmm0, xmm1, xmm0
vmovupd [rdx], xmm0
vcvttps2dq xmm0, [rdx]
vpextrb eax, xmm0, 0
mov [rcx+2], al
vpextrb eax, xmm0, 4
mov [rcx+1], al
vpextrb eax, xmm0, 8
mov [rcx], al
vpextrb eax, xmm0, 0xc
mov [rcx+3], al
ret

We'll also be improving the codegen around vpextrb more in the future so it can be just vpextrb [rcx+2], xmm0, 0 instead of vpextrb eax, xmm0, 0 followed by mov [rcx+2], al.

You can also optimize in .NET 6+ by directly using Vector128.Create(). This creates a method local constant and avoids the static initializer entirely:

    private static Vector4 MaxBytes => Vector128.Create(255f).AsVector4();
    private static Vector4 Half => Vector128.Create(0.5f).AsVector4();

Steps to Reproduce

NA

Images

No response

Sep 16 '22 05:09 JimBobSquarePants

ImageSharp is .NET 6+ now, correct?

If you want to assign this to me, I can work on getting a fix up and take a peek at some of the other SIMD code for other .NET 6+ specific improvements.

Sep 16 '22 16:09 tannergooding

ImageSharp is .NET 6+ now, correct?

If you want to assign this to me, I can work on getting a fix up and take a peek at some of the other SIMD code for other .NET 6+ specific improvements.

@tannergooding yes it's .NET6+ now. I have assigned you to this task, thanks!

Sep 16 '22 20:09 brianpopow

@tannergooding Any and all suggestions will be deeply appreciated 😄

Sep 16 '22 20:09 JimBobSquarePants

Few questions...

How would you like me to structure the PR(s)?
What's the consideration for #ifdef vs providing helper methods?
What's the recommended way to benchmark and present numbers as part of a PR?
How much "freedom" is there in changing internal implementation details of structs/methods?

For 1, I'm mostly asking as I can try and do several "smallish" PRs each targeting a specific type or I can do large PRs that try and cover changes more holistically across the repo.

For 2, mostly asking what the preferred way of handling cases such as where Arm64 support might be added. For .NET 6, we have to directly use AdvSimd. While for .NET 7, you can just use the methods/operators directly on Vector128<float> and the JIT will do the "optimal" thing for x64 or AdvSimd or WASM or other future platforms.

For 3, naturally I'd like to show numbers actually indicating changes are wins and in a format expected by the repo. I didn't see any callouts in the contributing docs, but maybe I missed it.

For 4, as an example there are a number of structs that contain 2/3/4 float fields (often exposed via auto-props). Due to the JIT specializing Vector2/3/4 but not user-defined structs today (hopefully I can get this fixed eventually, so all qualifying types get this same specialization), you will likely get better codegen by wrapping a single Vector2/3/4 instead.

Additionally there is the potential to wrap Vector128<float> for "even better" perf. However, the last one comes with a consideration that the type can no longer be used for interop, if that's a consideration at all. -- There is an easy workaround of getting the backing Vector128<float> field and using .AsVector4() if interop is a consideration, but it is something to think about.

There's also a slightly more complex consideration in that while Vector2/3/4 are currently accelerated and generally performant, types like Matrix3x2, Matrix4x4, and other neighboring types are not (I'm going to try and improve that in .NET 8, but no guarantees). I could write-up more performant helpers and utilize Unsafe.As to achieve significant perf wins for .NET 6/7 without changing the public surface area from taking the System.Numerics types.

Also some cases where Vector128<float> is a better choice than Vector4 in .NET 6, but where they are effectively the same in .NET 7. The .AsVector4() and .AsVector128() APIs allow zero-cost conversion in .NET 6+, so using Vector128<float> in internal methods can provide wins.

Sep 17 '22 03:09 tannergooding

Apologies for the slow reply, was offine this weekend.

I'll try to summarise a response to each question for you.

Since the focus is mostly on pixel formats here, I think we can err towards the larger side. I can see a lot of repitition on the horizon so it should be ok to review.
I'd like to avoid #ifdef where possible since we are only targetting the single framework. If we add Arm64 (_something we really need to get a performant build process in place for - QEMU is far too slow!) we can use AdvSimd and when .NET 8 comes around, we'll switch target to that and incrementally replace the methods to use the new APIs as and when we can.
Benchmarking is a little adhoc. There is a folder in the benchmark project under General/PixelConversion where you can place benchmarks in whatever format you see fit in.
You have absolute freedom there, as long as the changes do not make maintaining the codebase more difficult. I'd personally love to put a lot of focus into streamlining and simplifying some of the internal code but there's still some other chunky tasks - #1433 - that need to be done.

Thanks again for your help here!

Sep 18 '22 22:09 JimBobSquarePants

Thanks a lot for the responses, they all make sense to me!

Sep 18 '22 23:09 tannergooding

Just wanted to give an update on this. I've been familiarizing myself with the codebase and the various SIMD code, making notes of possible improvements as I go.

I'll be closing on my first house and moving over the next couple weeks so will endup taking a short break, but hope to have a PR up and a bullet list of potential additional improvements closer to the end of the month.

Oct 04 '22 21:10 tannergooding

Congratulations on the house!

No worries, enjoy your break. Looking forward to seeing what you come up with!

Oct 04 '22 22:10 JimBobSquarePants

-- Am still working on this. Codegen wasn't doing what I expected so I spent some time fixing that up in the JIT (believe you've seen the PR already 😄).

With the changes I've made, the codegen on my local changes is significantly better. So should provide some nice wins off the bat and even better gains once 8.0 comes around.

Dec 16 '22 00:12 tannergooding

Yeah, I saw that PR 😃 awesome stuff!

NET 8 seems like it's going to allow massive improvements to the ImageSharp codebase. Simplified SIMD with out of the box ARM support will be a gamechanger.

Dec 17 '22 02:12 JimBobSquarePants

Also got a rewrite of Matrix4x4 and Matrix3x2 in: https://github.com/dotnet/runtime/pull/80091

This resulted in perf improvements of 2x up to 48x and should have a huge positive impact on ImageSharp. There is also opportunity for me to rewrite/improve Plane and Quaternion, but I didn't see any broad impact in my initial profile captures.

I plan on rerunning ImageSharp perf benchmarks once I have a dotnet/installer build available, but that represents the largest point of unexpected perf I was seeing. After that, I can finish up on the changes I've been working on here and it should be even better.

Jan 10 '23 18:01 tannergooding

Haha.... You're a whole new level of awesome. I was expecting a few small changes; you're rewriting the runtime for the benefit of all. Truly amazing!

Jan 11 '23 06:01 JimBobSquarePants

I think our ColorMatrix type can benefit from copying those changes.

Jan 15 '23 08:01 JimBobSquarePants

ImageSharp ImageSharp copied to clipboard

Performance optimization opportunities in common pixel formats.

Prerequisites

ImageSharp version

Other ImageSharp packages and versions

Environment (Operating system, version and so on)

.NET Framework version

Description

Steps to Reproduce

Images

ImageSharp
ImageSharp copied to clipboard