qoi2-bikeshed
qoi2-bikeshed copied to clipboard
Diff from average instead of previous pixel
For the luma operations, I tried to use the average of the previous pixel and the pixel above the current one to compare against instead of comparing to the previous pixel alone. It resulted in significant better compression. Together with the other tags I chose, I was able to get the compression numbers below.
# Grand total for images
decode ms encode ms decode mpps encode mpps size kb rate
qoi: 2.6 3.0 177.46 155.06 463 28.2%
qoi2avg: 2.9 3.7 162.71 125.88 405 24.7%
# Grand total for images-lance
decode ms encode ms decode mpps encode mpps size kb rate
qoi: 17.2 15.1 236.14 268.34 2109 13.6%
qoi2avg 17.2 18.7 236.05 217.03 1709 11.0%
Impressive compression numbers from a relatively simple change. There's a slight change in opcodes but roughly speaking most of the difference from the 420 of your previous variant to the 405 here is from diffing the average?
The main downside of this is that it doesn't work well with a streaming encoder/decoder..
The other changes from opcodes are there only for better handling of images with an alpha channel at the cost of turning luma686 into luma676 which made the compression a bit worse for 'images' and better for 'images-lance'. I believe 'images' was at 422 kb with the changes for the alpha channel and dropped to 405 kb by taking the diff from the average instead of the previous pixel. So it was indeed a significant drop with a relative simple change.
The main downside would indeed be for a streaming encoder / decoder because you have to revisit previous pixels. When encoding / decoding from buffer to buffer it's easy because previous pixels are easy accessible. For streaming you would have to keep a cache of the previous line of pixels.
A streaming encoder's memory requirements relying on the width of the image is the biggest issue IMO, I wonder if there's a way to have a well-performing fixed-size previous-line cache. Alternatively it would probably be good enough to set a max size of the previous-line cache to something like 4096 pixels, that way 99% of input benefits fully from averaging and the rest can at least benefit from averaging for the first 4096 pixels in every row (OTOH, it may not work in practice).
There was also this tiling proposal - maybe just read the whole image in terms of a sequence of tiles instead of sequence of pixels would do the trick?
that would work, but it is strictly harder for streaming encoders.
I wonder if there's a way to have a well-performing fixed-size previous-line cache. Alternatively it would probably be good enough to set a max size of the previous-line cache to something like 4096 pixels, that way 99% of input benefits fully from averaging and the rest can at least benefit from averaging for the first 4096 pixels in every row (OTOH, it may not work in practice).
Since you would need the pixel right above the current one you need to cache a full width of pixels. The speed would probably acceptable since you could use a FIFO / ring buffer to cache the pixels. You would need to allocate the memory for that so for an image with a resolution of 20,000,000 x 2 pixels it wouldn't be very useful but for the majority of images it would work fine. An 8K image ( 7680x4320 pixels ) would require a cache of 22.5 KiB (caching alpha channel isn't required).
that would work, but it is strictly harder for streaming encoders.
Is it? I mean, the simpliest solution (to ignore tile boundaries for optimization/compression and optimize/compress only each tile individually) could be viewed as a list of small unrelated pictures (of course some operations like dithering won't work this way without artifacts, but that's not what's being discussed in this thread, right?).
That sounds darn streaming-oriented and quite easy to implement (not that I'd have time to try it out now - it's just my thinking aloud).
An 8K image ( 7680x4320 pixels ) would require a cache of 22.5 KiB (caching alpha channel isn't required).
My proposal fixes the cache to some size (lets say 8192 pixels which sounds reasonable as an upper bound of expected input size). This would mean the 7680x4320 image uses an averaged reference everywhere, as would a 7680x20,000,000 image. The 20,000,000 x 2 image would use averaging for the first 8192 pixels in each row, previous pixel for the rest where it has no upper reference. If it works it would allow averaging for 99% of input without screwing streaming encoders or forcing high memory requirements. The main drawback is introducing a condition for the boundary of high-width input. The branch predictor can probably be leaned on instead of complicating the implementation as for 99% of input a branch predictor should effectively nullify a branch that only ever goes one way(?).
The 20,000,000 x 2 image would use averaging for the first 8192 pixels in each row, previous pixel for the rest where it has no upper reference. If it works it would allow averaging for 99% of input without screwing streaming encoders or forcing high memory requirements. The main drawback is introducing a condition for the boundary of high-width input.
That should work. With such an approach, it might make sense to use en outer loop for the height with an inner loop for the width instead of a single loop for all pixels.
I thought about the rounding of the average a bit more.
In my previous code I used (px1 + px2) >> 1
so a .5 value was rounded down.
Since the delta values can adjust down 1 more compared to up, it makes more sense to round in a normal way.
(px1 + px2 + 1) >> 1
so a .5 value is rounded up.
By making this small adjustment, 'images' drops from 405 kb to 404 and 'images-lance' from 1709 to 1706.
Not much of a difference but worth to change the rounding I think.
qoi2avg2.h.txt
Here's my numbers for 'regular' qoi2avg2.h.txt
(which I've called qoi-reg-avg2
). I also wrapped LZ4 around it (called qoi-lz4-avg2
). For comparison, qoi-mas-c04
(commit c04a975 from the phoboslab/master branch, 2021-12-24) and qoi-lz4-d28
are from https://github.com/nigeltao/qoi2-bikeshed/issues/28#issuecomment-1002079802
qoi-lz4-avg2
compresses better than libpng
on the images test suite (with 2.5x faster decode, 20x faster encode)! Performance could probably be optimized, too.
decode ms encode ms decode mpps encode mpps size kb rate
## icon_512/apps-preferences-desktop-locale.png size: 512x512
libpng: 5.1 23.6 51.08 11.09 25 2.5%
qoi-mas-c04: 1.9 1.8 141.22 148.71 186 18.2%
qoi-reg-avg2: 2.1 2.4 127.80 110.18 171 16.8%
qoi-lz4-d28: 1.6 1.7 162.76 152.28 27 2.7%
qoi-lz4-avg2: 2.0 2.4 131.81 107.72 38 3.7%
## Total for images/icon_512
libpng: 4.9 33.7 52.98 7.78 51 5.0%
qoi-mas-c04: 1.4 1.4 182.31 182.77 85 8.4%
qoi-reg-avg2: 1.5 1.8 180.13 142.30 76 7.5%
qoi-lz4-d28: 0.9 1.6 289.36 163.47 50 4.9%
qoi-lz4-avg2: 1.5 2.0 175.10 128.33 55 5.4%
## Total for images/photo_kodak
libpng: 15.3 192.1 25.78 2.05 648 56.3%
qoi-mas-c04: 5.6 7.6 69.91 51.85 671 58.3%
qoi-reg-avg2: 7.3 9.6 53.84 41.08 600 52.2%
qoi-lz4-d28: 6.1 8.0 64.25 49.44 651 56.5%
qoi-lz4-avg2: 7.2 9.9 54.98 39.69 600 52.2%
## Total for images/screenshot_web
libpng: 92.6 1045.6 87.78 7.77 2402 7.6%
qoi-mas-c04: 50.1 47.1 162.16 172.58 2649 8.3%
qoi-reg-avg2:51.2 63.1 158.55 128.84 2302 7.3%
qoi-lz4-d28: 30.6 55.9 265.65 145.24 2080 6.6%
qoi-lz4-avg2:46.6 64.5 174.28 125.92 2070 6.5%
# Grand total for images
libpng: 13.5 153.7 34.44 3.02 398 24.2%
qoi-mas-c04: 4.7 5.6 99.25 82.46 463 28.2%
qoi-reg-avg2: 5.4 7.1 86.37 65.40 404 24.6%
qoi-lz4-d28: 4.2 6.0 110.38 77.11 416 25.4%
qoi-lz4-avg2: 5.4 7.5 85.54 61.86 391 23.8%
# Grand total for images-lance
libpng: 103.5 925.0 39.25 4.39 1395 9.0%
qoi-mas-c04: 35.2 32.5 115.26 124.91 2109 13.6%
qoi-reg-avg2:37.7 40.5 107.59 100.39 1706 11.0%
qoi-lz4-d28: 28.7 36.2 141.62 112.13 1530 9.8%
qoi-lz4-avg2:38.2 42.9 106.27 94.71 1590 10.2%
Compile with:
gcc -I ../stb qoibench.c -llz4 -lpng -O3
The Gamut library https://github.com/AuburnSounds/gamut provides both QOI format and "QOIX", an evolving format which is
currently qoi2avg.h from @wbd73 followed by LZ4 compression (ie, it is qoi-lz4-avg2
without the rounding fix).
@wbd73 Would you disclose your real name in order to have proper copyright information?
@wbd73 Would you disclose your real name in order to have proper copyright information?
The way you did it now is fine with me. Some ideas (luma232 with a different bias and using the average of two pixels as a prediction) came from me but using a FIFO for the color index was someone else idea.
My QOIR repo is a work-in-progress, but FYI when I change the predictor from (left) to ((1+left+above)/2),
- I can reproduce the slow-down in encode and decode speed, but
- I can claw almost all of it back by using SIMD SSE2 (encode, decode), but
- SIMD is obviously less portable.
If you use a union so you can access the four uint8 bytes of a pixel also as a 32 bit integer value, you can use SWAR (SIMD within a register). It's a bit of an in between solution that is more portable.
/* SWAR */
#define QOI_PAVGB(P0, P1) ((P0 | P1) - (((P0 ^ P1) >> 1) & 0x7f7f7f7f))
#define QOI_PSUBB(P0, P1) ((P0 | 0x80808080) - (P1 & 0x7f7f7f7f)) ^ ((P0 ^ ~P1) & 0x80808080)
In this case if P0 is one pixel represented as a 32 bit integer value and P1 is another pixel, QOI_PAVGB returns a 32 bit integer where each byte is the average of the specific byte of the two pixels (rounded up as preferred). QOI_PSUBB works in a similar way but returns the difference.