QoiSharp icon indicating copy to clipboard operation
QoiSharp copied to clipboard

30% performance increase

Open crener opened this issue 3 years ago • 2 comments

Hey, saw there was a .Net port of QOI and got curious about the new performance things that .net standard has. Here is what I came up with:

Encode:

Method CurrentPath Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
OriginalEncode Large.jpg 201,046.6 us 722.04 us 675.39 us 1.00 - - - 74,109 KB
OptimisedEncode Large.jpg 139,210.1 us 332.80 us 295.02 us 0.69 - - - 18,885 KB
OriginalEncode Medium.jpg 11,649.1 us 47.91 us 42.47 us 1.00 31.2500 31.2500 31.2500 4,385 KB
OptimisedEncode Medium.jpg 7,930.6 us 73.48 us 65.14 us 0.68 - - - 1,749 KB
OriginalEncode Small.jpg 1,054.2 us 1.55 us 1.29 us 1.00 1.9531 1.9531 1.9531 422 KB
OptimisedEncode Small.jpg 724.7 us 3.00 us 2.81 us 0.69 - - - 129 KB

Decode:

Method CurrentPath Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
OriginalDecode Large.qoi 127,637.5 us 576.69 us 539.43 us 1.00 250.0000 250.0000 250.0000 55,285 KB
OptimisedDecode Large.qoi 123,073.7 us 222.35 us 197.11 us 0.96 250.0000 250.0000 250.0000 55,284 KB
OriginalDecode Medium.qoi 6,381.4 us 26.74 us 22.33 us 1.00 46.8750 46.8750 46.8750 2,637 KB
OptimisedDecode Medium.qoi 5,763.2 us 41.81 us 39.11 us 0.90 70.3125 70.3125 70.3125 2,637 KB
OriginalDecode Small.qoi 636.7 us 1.35 us 1.19 us 1.00 5.8594 5.8594 5.8594 293 KB
OptimisedDecode Small.qoi 630.1 us 9.91 us 9.27 us 0.99 7.8125 7.8125 7.8125 293 KB

I think that potential users would quite like these changes as it's a nice speedup overall

Main changes

  • Minimize GC by using ArrayPool to use a temporary array when encoding/decoding so that nothing needs to be allocated and thrown away.
  • Use Span for readonly array access and it has a lower overhead (funnily enough not all places where there was an array had a benefit from Span)
  • Some aggressive in-lining means that small methods don't add a new stack frame and improves performance at the expense of assembly size (but the amount of usages in this case don't make a huge difference to size)

Minor changes

  • split the rgba and rgb code so that there are no redundant checks inside the loop which reduces branching (and cuts down on some CPU work). This does give about a 5-7% improvement but it does massively increase the size of the code and makes some duplicate code so I can understand not wanting this part of the change.

crener avatar Dec 28 '21 04:12 crener

Using the two benchmarks in the PR I submitted ( https://github.com/NUlliiON/QoiSharp/pull/8 ):

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
QoiEncoding 5.887 ms 0.0168 ms 0.0149 ms 101.5625 101.5625 101.5625 729 KB

and

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
QoiDecoding 4.049 ms 0.0175 ms 0.0155 ms 203.1250 203.1250 203.1250 2 MB

The WIP perf improvements in my various PRs were combined into this: https://github.com/NUlliiON/QoiSharp/pull/7#issuecomment-1002782226 and offer faster encoding and decoding as compared to this PR (for the specific image in the benchmark).

There's definitely some stuff worth integrating from both your PR and my PR. I was hesitant about splitting into a Decode3 and Decode4 method - however if the maintainer is comfortable with that we could investigate iterating over the byte[] using Marshal.Cast<byte, RGB> or Marshal.Cast<byte, RGBA> (where the structs are: struct RGB { public byte r, g, b; } and struct RGBA { public byte r, g, b, a; }) . There are some other tricks that could be investigated too :)

alanmcgovern avatar Dec 29 '21 21:12 alanmcgovern

Yeah the rgb split is a little extreme but when I did it I got a good 10% improvement in decode which made me do it in encode and that was 5%.

crener avatar Dec 30 '21 12:12 crener