30% performance increase
Hey, saw there was a .Net port of QOI and got curious about the new performance things that .net standard has. Here is what I came up with:
Encode:
| Method | CurrentPath | Mean | Error | StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| OriginalEncode | Large.jpg | 201,046.6 us | 722.04 us | 675.39 us | 1.00 | - | - | - | 74,109 KB |
| OptimisedEncode | Large.jpg | 139,210.1 us | 332.80 us | 295.02 us | 0.69 | - | - | - | 18,885 KB |
| OriginalEncode | Medium.jpg | 11,649.1 us | 47.91 us | 42.47 us | 1.00 | 31.2500 | 31.2500 | 31.2500 | 4,385 KB |
| OptimisedEncode | Medium.jpg | 7,930.6 us | 73.48 us | 65.14 us | 0.68 | - | - | - | 1,749 KB |
| OriginalEncode | Small.jpg | 1,054.2 us | 1.55 us | 1.29 us | 1.00 | 1.9531 | 1.9531 | 1.9531 | 422 KB |
| OptimisedEncode | Small.jpg | 724.7 us | 3.00 us | 2.81 us | 0.69 | - | - | - | 129 KB |
Decode:
| Method | CurrentPath | Mean | Error | StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| OriginalDecode | Large.qoi | 127,637.5 us | 576.69 us | 539.43 us | 1.00 | 250.0000 | 250.0000 | 250.0000 | 55,285 KB |
| OptimisedDecode | Large.qoi | 123,073.7 us | 222.35 us | 197.11 us | 0.96 | 250.0000 | 250.0000 | 250.0000 | 55,284 KB |
| OriginalDecode | Medium.qoi | 6,381.4 us | 26.74 us | 22.33 us | 1.00 | 46.8750 | 46.8750 | 46.8750 | 2,637 KB |
| OptimisedDecode | Medium.qoi | 5,763.2 us | 41.81 us | 39.11 us | 0.90 | 70.3125 | 70.3125 | 70.3125 | 2,637 KB |
| OriginalDecode | Small.qoi | 636.7 us | 1.35 us | 1.19 us | 1.00 | 5.8594 | 5.8594 | 5.8594 | 293 KB |
| OptimisedDecode | Small.qoi | 630.1 us | 9.91 us | 9.27 us | 0.99 | 7.8125 | 7.8125 | 7.8125 | 293 KB |
I think that potential users would quite like these changes as it's a nice speedup overall
Main changes
- Minimize GC by using
ArrayPoolto use a temporary array when encoding/decoding so that nothing needs to be allocated and thrown away. - Use
Spanfor readonly array access and it has a lower overhead (funnily enough not all places where there was an array had a benefit from Span) - Some aggressive in-lining means that small methods don't add a new stack frame and improves performance at the expense of assembly size (but the amount of usages in this case don't make a huge difference to size)
Minor changes
- split the rgba and rgb code so that there are no redundant checks inside the loop which reduces branching (and cuts down on some CPU work). This does give about a 5-7% improvement but it does massively increase the size of the code and makes some duplicate code so I can understand not wanting this part of the change.
Using the two benchmarks in the PR I submitted ( https://github.com/NUlliiON/QoiSharp/pull/8 ):
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|
| QoiEncoding | 5.887 ms | 0.0168 ms | 0.0149 ms | 101.5625 | 101.5625 | 101.5625 | 729 KB |
and
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|
| QoiDecoding | 4.049 ms | 0.0175 ms | 0.0155 ms | 203.1250 | 203.1250 | 203.1250 | 2 MB |
The WIP perf improvements in my various PRs were combined into this: https://github.com/NUlliiON/QoiSharp/pull/7#issuecomment-1002782226 and offer faster encoding and decoding as compared to this PR (for the specific image in the benchmark).
There's definitely some stuff worth integrating from both your PR and my PR. I was hesitant about splitting into a Decode3 and Decode4 method - however if the maintainer is comfortable with that we could investigate iterating over the byte[] using Marshal.Cast<byte, RGB> or Marshal.Cast<byte, RGBA> (where the structs are: struct RGB { public byte r, g, b; } and struct RGBA { public byte r, g, b, a; }) . There are some other tricks that could be investigated too :)
Yeah the rgb split is a little extreme but when I did it I got a good 10% improvement in decode which made me do it in encode and that was 5%.