QoiSharp 30% performance increase

Hey, saw there was a .Net port of QOI and got curious about the new performance things that .net standard has. Here is what I came up with:

Encode:

Method	CurrentPath	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
OriginalEncode	Large.jpg	201,046.6 us	722.04 us	675.39 us	1.00	-	-	-	74,109 KB
OptimisedEncode	Large.jpg	139,210.1 us	332.80 us	295.02 us	0.69	-	-	-	18,885 KB

OriginalEncode	Medium.jpg	11,649.1 us	47.91 us	42.47 us	1.00	31.2500	31.2500	31.2500	4,385 KB
OptimisedEncode	Medium.jpg	7,930.6 us	73.48 us	65.14 us	0.68	-	-	-	1,749 KB

OriginalEncode	Small.jpg	1,054.2 us	1.55 us	1.29 us	1.00	1.9531	1.9531	1.9531	422 KB
OptimisedEncode	Small.jpg	724.7 us	3.00 us	2.81 us	0.69	-	-	-	129 KB

Decode:

Method	CurrentPath	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
OriginalDecode	Large.qoi	127,637.5 us	576.69 us	539.43 us	1.00	250.0000	250.0000	250.0000	55,285 KB
OptimisedDecode	Large.qoi	123,073.7 us	222.35 us	197.11 us	0.96	250.0000	250.0000	250.0000	55,284 KB

OriginalDecode	Medium.qoi	6,381.4 us	26.74 us	22.33 us	1.00	46.8750	46.8750	46.8750	2,637 KB
OptimisedDecode	Medium.qoi	5,763.2 us	41.81 us	39.11 us	0.90	70.3125	70.3125	70.3125	2,637 KB

OriginalDecode	Small.qoi	636.7 us	1.35 us	1.19 us	1.00	5.8594	5.8594	5.8594	293 KB
OptimisedDecode	Small.qoi	630.1 us	9.91 us	9.27 us	0.99	7.8125	7.8125	7.8125	293 KB

I think that potential users would quite like these changes as it's a nice speedup overall

Main changes

Minimize GC by using ArrayPool to use a temporary array when encoding/decoding so that nothing needs to be allocated and thrown away.
Use Span for readonly array access and it has a lower overhead (funnily enough not all places where there was an array had a benefit from Span)
Some aggressive in-lining means that small methods don't add a new stack frame and improves performance at the expense of assembly size (but the amount of usages in this case don't make a huge difference to size)

Minor changes

split the rgba and rgb code so that there are no redundant checks inside the loop which reduces branching (and cuts down on some CPU work). This does give about a 5-7% improvement but it does massively increase the size of the code and makes some duplicate code so I can understand not wanting this part of the change.

Dec 28 '21 04:12 crener

Using the two benchmarks in the PR I submitted ( https://github.com/NUlliiON/QoiSharp/pull/8 ):

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
QoiEncoding	5.887 ms	0.0168 ms	0.0149 ms	101.5625	101.5625	101.5625	729 KB

and

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
QoiDecoding	4.049 ms	0.0175 ms	0.0155 ms	203.1250	203.1250	203.1250	2 MB

The WIP perf improvements in my various PRs were combined into this: https://github.com/NUlliiON/QoiSharp/pull/7#issuecomment-1002782226 and offer faster encoding and decoding as compared to this PR (for the specific image in the benchmark).

There's definitely some stuff worth integrating from both your PR and my PR. I was hesitant about splitting into a Decode3 and Decode4 method - however if the maintainer is comfortable with that we could investigate iterating over the byte[] using Marshal.Cast<byte, RGB> or Marshal.Cast<byte, RGBA> (where the structs are: struct RGB { public byte r, g, b; } and struct RGBA { public byte r, g, b, a; }) . There are some other tricks that could be investigated too :)

Dec 29 '21 21:12 alanmcgovern

Yeah the rgb split is a little extreme but when I did it I got a good 10% improvement in decode which made me do it in encode and that was 5%.

Dec 30 '21 12:12 crener