ImageSharp icon indicating copy to clipboard operation
ImageSharp copied to clipboard

Speed Up Jpeg Encoder Color Conversion

Open JimBobSquarePants opened this issue 3 years ago • 13 comments

Some analysis of the performance of the encoder based upon a breakdown of this benchmark indicates that encoding a large jpeg takes 80% of the entire processing time.

https://github.com/kleisauke/net-vips/tree/master/tests/NetVips.Benchmarks

This is due to the lack of hardware acceleration in our color conversion approach.

The current Jpeg encoder utilizes predefined tables to convert a span of Rgb24 pixels into separate Y Cb Cr Block8x8F planes.

https://github.com/SixLabors/ImageSharp/blob/f1a0fb61798e718314b26af37d4470e7ec381793/src/ImageSharp/Formats/Jpeg/Components/Encoder/YCbCrForwardConverter%7BTPixel%7D.cs#L58-L83

While this is faster than naïve per-pixel floating point calculation it can be heavily optimized.

Short Term Goal

Add AVX2? acceleration directly to the converter to optimize conversion for .NET Core 3.1+. This should be a few hours work for someone with SIMD knowledge.

Long Term Goal

Establish an architecture similar to the Jpeg decoder ColorConverters allowing incremental addition accelerated converters for all platforms and color spaces.

JimBobSquarePants avatar Dec 15 '20 11:12 JimBobSquarePants

DirectX Math is MIT licensed and already provides SIMD accelerated algorithms (for x86/x64 and ARM64) for many standard color conversions: https://docs.microsoft.com/en-us/windows/win32/api/directxmath/nf-directxmath-xmcolorrgbtoyuv

https://github.com/microsoft/DirectXMath/blob/master/Inc/DirectXMathMisc.inl#L1728-L1738

It's documented to use ITU-R BT.601/CCIR 601 W(r) = 0.299 W(b) = 0.114 U(max) = 0.436 V(max) = 0.615., which I believe is what you want.

There is likewise https://docs.microsoft.com/en-us/windows/win32/api/directxmath/nf-directxmath-xmcolorrgbtoyuv_hd which uses ITU-R BT.709 W(r) = 0.2126 W(b) = 0.0722 U(max) = 0.436 V(max) = 0.615.

tannergooding avatar Dec 15 '20 17:12 tannergooding

Thanks @tannergooding I'll see how well the code there fits with our existing architecture.

JimBobSquarePants avatar Dec 16 '20 00:12 JimBobSquarePants

Note that the referenced netvips benchmark is quite atypical for Resize. Users usually downscale much more than 90%, so I wouldn't worry that much for the encoder being a bottleneck.

We should profile this though, I wonder how do the bottlenecks distribute exactly.

antonfirsov avatar Jan 18 '21 11:01 antonfirsov

@Sergio0694 was doing some work with 4K images the other day and the benchmarks he showed me indicated that the encoder was a major bottleneck. Not atypical but also not that uncommon.

I’ll try and dig out the screenshot

JimBobSquarePants avatar Jan 18 '21 18:01 JimBobSquarePants

jpeg-encoder-bench

JimBobSquarePants avatar Jan 18 '21 23:01 JimBobSquarePants

@tkp1n Has provided us with some great improvements via #1508 and I'll profile an encode to see where further bottlenecks are.

JimBobSquarePants avatar Jan 18 '21 23:01 JimBobSquarePants

Ah yeah saving 4K JPEG must be very slow indeed. (Hope noticeably better with #1508).

What I mean is that we got away with a slow encoder this long because typical thumbnail maker code is usually saving a very small output image, so it's not that hot for web content management probably. (Doesn't mean it's not painful in other use-cases.)

antonfirsov avatar Jan 19 '21 16:01 antonfirsov

Yeah I was very surprised to see just how much slower ImageSharp was at JPEG encoding/decoding 😥

I was expecting it to be somewhat on par, but especially the encoding part is really a lot, a lot slower. In my case basically just saving the image takes more than the entirety of copying to GPU, processing it and copying it back. But like, it takes 4x times as all those steps combined, and I haven't even optimized them that much either. I was kinda tempted to switch my samples to System.Drawing, though in the end I didn't because, well, I love you guys, and also the API surface of System.Drawing is ugly 😄

Point is, any speed improvements in this area would be a super welcome improvement, especially if you're all concerned about people running comparative benchmarks between ImageSharp and other common image processing libraries. On this point, will make some tests on a few improvements I've been meaning to add to the resize kernel using FMA instructions too 🚀

Sergio0694 avatar Jan 19 '21 16:01 Sergio0694

I've attached a speedscope dump from PerfView as asked for in https://github.com/SixLabors/ImageSharp/pull/1517#issuecomment-764804093.. The trace is from a BenchmarkDotNet benchmark of a 4K JPEG export after the optimization in #1517.

Unzip it, open it in https://www.speedscope.app/, select "Left heavy" (top menu), scroll all the way down..

JPEG_encode.speedscope.zip

tkp1n avatar Jan 21 '21 19:01 tkp1n

Not a pro with this tool, but if I'm reading it right, RowOctet constructor, and Emit are the new bottlenecks. @tkp1n can you confirm?

Here only the RowOctet thing is related to color conversion and can be fixed by addressing the following TODO note: https://github.com/SixLabors/ImageSharp/blob/0e0dc2ae9cafcdf5bde9d185919cf073ddf4f186/src/ImageSharp/Formats/Jpeg/JpegEncoderCore.cs#L1011-L1012

antonfirsov avatar Jan 21 '21 19:01 antonfirsov

can you confirm?

Exactly, yes.

tkp1n avatar Jan 21 '21 19:01 tkp1n

@JimBobSquarePants I think we can close this in favor of a general JpegEncoder perf tracking issue.

antonfirsov avatar Jan 21 '21 19:01 antonfirsov

Ok. Let’s migrate all the relevant info.

JimBobSquarePants avatar Jan 21 '21 19:01 JimBobSquarePants

Fixed via #2120

JimBobSquarePants avatar Aug 11 '22 13:08 JimBobSquarePants