ImageSharp icon indicating copy to clipboard operation
ImageSharp copied to clipboard

Optimize pixel blending with integer arithmetics

Open antonfirsov opened this issue 5 years ago • 14 comments
trafficstars

Problem

The current float and Vector4 -based pixel blender API does not give us too much space for introducing the rasterization perf improvements needed for SixLabors/ImageSharp.Drawing#102. With our current bulk API, the maximum we can do is to process 2 pixels in one AVX batch, since we can fit only 8 float-s into one AVX register. This means that the expected speedup for blending is around or below 2x. With this we would keep lagging behind Skia and GDI significantly.

Idea

We should explore API-s and implementations working with UInt16-based fixed point arithmetics. This is technically very similar to approach taken by the libjpeg decoder SIMD pipelines which we eventually also want to adapt. In theory, UInt16-based bulk processing should reduce the time spent in pixel blenders by ~4x (or more) when AVX2 is present.

This will require API additions similar to the following:

public abstract class PixelBlender<TPixel> 
{
    public void Blend<TPixelSrc>(
            Configuration configuration,
            Span<TPixel> destination,
            ReadOnlySpan<TPixel> background,
            ReadOnlySpan<TPixelSrc> source,
            // 'amount' is scaled to 0-255. Could be byte, but with UInt16 we will avoid some unnecessary conversions
            ReadOnlySpan<UInt16> amount); 

    protected virtual BlendFunction(
            Configuration configuration,
            Span<Rgba32> destination,
            ReadOnlySpan<Rgba32> background,
            ReadOnlySpan<Rgba32> source,
            ReadOnlySpan<UInt16> amount);
}


public static class PorterDuffFunctions
{
    /*public*/ static Vector4 NormalSrcOver(Span<Rgba32> destination,
            ReadOnlySpan<Rgba32> background,
            ReadOnlySpan<Rgba32> source,
            ReadOnlySpan<UInt16> opacity);
}

Update: In the first API variant there was a type ScaledUInt16Vector4, but after thinking it through, I realized it is unnecessary. We should work with Rgba32 to maximize perf.

antonfirsov avatar Nov 20 '20 18:11 antonfirsov

Agreed. This seems to be the smartest approach while keeping work on the CPU. Other libraries defer to the GPU for perf

JimBobSquarePants avatar Nov 21 '20 10:11 JimBobSquarePants

Pinning this as It's super important for ImageSharp.Drawing

JimBobSquarePants avatar Sep 09 '21 14:09 JimBobSquarePants

@br3aker I know it's not jpeg but is this something you'd be interested in? Seems like something you could churn out whereas it would take me ages.

JimBobSquarePants avatar Sep 29 '21 13:09 JimBobSquarePants

This is important in long term, but improving Jpeg decoder performance has higher priority at the moment IMO, so if @br3aker is more interested in Jpeg he's doing the right thing :)

We need to close the 40% gap to vips and MagicScaler in thumbnail making. (See #1775).

antonfirsov avatar Oct 03 '21 12:10 antonfirsov

Yep. Let's get that gap closed! We still need someone to look at this so hoping someone from the community can step up while I focus on the Fonts/Drawing libraries themselves.

JimBobSquarePants avatar Oct 03 '21 13:10 JimBobSquarePants

Oops, totally missed this conversation, I'm sorry.

We can work something out after jpeg decoder :)

br3aker avatar Oct 05 '21 00:10 br3aker

I think this is our only outstanding feature that is blocking V2 now. I don't think we can release ImageSharp.Drawing without it.

JimBobSquarePants avatar Jan 18 '22 00:01 JimBobSquarePants

This is huge work. I mean, it can literally take engineer-weeks, and there is no guarantee for massive improvements from integer arithmetics, we can get surprised just like it happened with #1943. Actually, I would try to create an SSE implementation utilizing float instruction tricks as the first iteration, that could already bring noticeable improvements.

The API addition I'm proposing is not breaking, so I don't see why would we block the core library's V2 release on this. Even if ImageSharp.Drawing V1 goes out with a decent API and stable functionality, but without this optimization I think it would be still better than keeping it beta forever.

antonfirsov avatar Jan 18 '22 01:01 antonfirsov

OK @antonfirsov taking your lead here then.

If we choose to defer we should get the last few Issues/PRs completed and add nothing new to V2 unless it's a critical bug. I'd like to get it shipped asap.

JimBobSquarePants avatar Jan 18 '22 04:01 JimBobSquarePants

Jpeg took far more time than I've expected, at least we now know that fixed point thingy may not be a good solution.

br3aker avatar Jan 18 '22 10:01 br3aker

Jpeg took far more time than I've expected But look how fast you've made it!

JimBobSquarePants avatar Jan 18 '22 11:01 JimBobSquarePants

Much is yet to come!

br3aker avatar Jan 18 '22 12:01 br3aker

I might be missing something but I'm a little concerned about the Rgba32 requirement here. We're going to need 16bit precision per component.

JimBobSquarePants avatar Jun 24 '22 03:06 JimBobSquarePants

Yeah, internally we would have to convert the components to 16bit just like most jpeg calculations do. This means that it might make sense to expose an Rgba64 API and do the widening/narrowing on the call site, or we can even consider making the API generic and enabling individual porter duff implementations through PixelBlender<T>, or a parallel type.

We will know better when we'll have a prototype.

antonfirsov avatar Jun 24 '22 12:06 antonfirsov

or we can even consider making the API generic and enabling individual porter duff implementations through PixelBlender<T>, or a parallel type.

Not sure I'm following here. We can already choose the operation via a combination of enums.

JimBobSquarePants avatar Dec 13 '22 08:12 JimBobSquarePants

I don't remember what I meant exactly, but most likely I was recommending to enable BlendFunction to work with the TPixel and TPixelSrc spans directly instead of forcing it to use a particular element type.

antonfirsov avatar Dec 13 '22 15:12 antonfirsov

It occurs to me that I don't actually understand our implementations at all. I had a stab at an Avx2 implementation of NormalSrcOver in order to run a benchmark as I suspect our current implementations are not as good as they could be, and I couldn't figure out what current implementation was doing compared to the w3c docs.

UPDATE.

Managed to put together an AVX2 version of Over using Vector256<float>. A fraction over 2x faster than our current implementation.

JimBobSquarePants avatar Jan 15 '23 13:01 JimBobSquarePants

@antonfirsov I've been having a good think about this and think the milestone should be moved to Future/v4

In order to implement this, we essentially have to reimplement the patterns we use for PixelOperations which as you say is a massive job.

I actually believe we can develop a better pattern using static virtual and abstract methods on interfaces with .NE7+

I can create a new issue to implement AVX2 additions to the Porter-Duff operations for v3. What do you think?

JimBobSquarePants avatar Jan 26 '23 01:01 JimBobSquarePants

The reason why we are chasing this is to make Drawing fast. It's actually more important to get it done first :) I don't think integer blending is critical for V3, having floating-point AVX2 would be already a nice improvement!

antonfirsov avatar Jan 26 '23 02:01 antonfirsov

Awesome. We should be able to incrementally implement versions of all the separable methods. I'll spin up an issue.

JimBobSquarePants avatar Jan 26 '23 03:01 JimBobSquarePants

Worth noting that while some operations are just as fast for integral. Others, like multiplication, can be ~2x slower and so the exact perf benefit may then vary a bit depending exactly on what's needed for the blend operation. -- SIMD float multiplication is ~4 cycles, SIMD int32 multiplication is ~10 cycles, SIMD int16 multiplication is ~5 cycles, but only does the lower or upper elements so you need 2 instructions

tannergooding avatar Feb 16 '23 22:02 tannergooding

I would add to that, FMA changes the equation as well, because in olden times you were looking at 4 cycles for float mul + 4 cycles for float add on the low end, compared to a best case of 5 + 1 for int16. With FMA, obviously we get that float add for free. A lot of the existing code in the wild was written pre-FMA, when the integer math hoops were more worth jumping through.

saucecontrol avatar Feb 16 '23 22:02 saucecontrol

So are we saying here that it's actually best to stick with Avx and Vector256<float>?

JimBobSquarePants avatar Feb 17 '23 00:02 JimBobSquarePants

Yeah, naively I thought that we can squeeze out much more from integers, but feels like #2340 is more worthy to chase now, there is good potential to reach better than 2x speedup with FMA.

antonfirsov avatar Feb 17 '23 00:02 antonfirsov

I guess that's good news then as it's a LOT less work.

I'll need a lot of help to do this though, I can do bit and pieces but will struggle with certain parts.

JimBobSquarePants avatar Feb 17 '23 00:02 JimBobSquarePants

So are we saying here that it's actually best to stick with Avx and Vector256<float>?

I hate to drop an "it depends" on you but...

I'd say the general rule of thumb is that the more complex the math, the more likely it is Vector256<float> will work out better. Balance that with the fact that if you're starting and ending with integer data, you pay an extra 4+4 cycles for the int->float->int in addition to only be able to fit half as many pixels in each vector. Obviously if you go float in one part, it's usually better to stay there, pipelining steps if possible.

On the other hand, integer domain makes it easier to incorporate bit twiddling hacks, plus there are some integer-only instructions that do crazy amounts of work in ~5 cycles, like our old pals mpsadbw and pmaddubsw. If you can take advantage of those tricks, obviously it could sway performance heavily in favor of integer math.

The porter-duff ops are simple enough they will probably be faster in integer domain, but taking into account the FMA factor, that win is probably not going to be big enough to justify added complexity if you're more comfortable in float land. You can always re-visit when you target a newer runtime with the simplified xplat intrinsics.

saucecontrol avatar Feb 17 '23 03:02 saucecontrol

So an integer based version of PorterDuffFunctions would be something I'd use for Rgba32 compatible pixel formats. We can use the existing bulk shuffle operations to translate them for working against and call them using overloads of PixelOperations.GetPixelBlender.

Unfortunately, implementing those would be outside my ability to deliver at present.

JimBobSquarePants avatar Feb 17 '23 09:02 JimBobSquarePants

I don't think we need this anymore. The new blender is very fast.

JimBobSquarePants avatar Mar 17 '23 03:03 JimBobSquarePants