qoi2-bikeshed Diff from average instead of previous pixel

For the luma operations, I tried to use the average of the previous pixel and the pixel above the current one to compare against instead of comparing to the previous pixel alone. It resulted in significant better compression. Together with the other tags I chose, I was able to get the compression numbers below.

# Grand total for images
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi:          2.6         3.0        177.46        155.06       463   28.2%
qoi2avg:      2.9         3.7        162.71        125.88       405   24.7%

# Grand total for images-lance
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi:         17.2        15.1        236.14        268.34      2109   13.6%
qoi2avg      17.2        18.7        236.05        217.03      1709   11.0%

qoi2avg.h.txt

Jan 05 '22 11:01 wbd73

Impressive compression numbers from a relatively simple change. There's a slight change in opcodes but roughly speaking most of the difference from the 420 of your previous variant to the 405 here is from diffing the average?

Jan 05 '22 14:01 chocolate42

The main downside of this is that it doesn't work well with a streaming encoder/decoder..

Jan 05 '22 14:01 oscardssmith

The other changes from opcodes are there only for better handling of images with an alpha channel at the cost of turning luma686 into luma676 which made the compression a bit worse for 'images' and better for 'images-lance'. I believe 'images' was at 422 kb with the changes for the alpha channel and dropped to 405 kb by taking the diff from the average instead of the previous pixel. So it was indeed a significant drop with a relative simple change.

The main downside would indeed be for a streaming encoder / decoder because you have to revisit previous pixels. When encoding / decoding from buffer to buffer it's easy because previous pixels are easy accessible. For streaming you would have to keep a cache of the previous line of pixels.

Jan 05 '22 14:01 wbd73

A streaming encoder's memory requirements relying on the width of the image is the biggest issue IMO, I wonder if there's a way to have a well-performing fixed-size previous-line cache. Alternatively it would probably be good enough to set a max size of the previous-line cache to something like 4096 pixels, that way 99% of input benefits fully from averaging and the rest can at least benefit from averaging for the first 4096 pixels in every row (OTOH, it may not work in practice).

Jan 05 '22 14:01 chocolate42

There was also this tiling proposal - maybe just read the whole image in terms of a sequence of tiles instead of sequence of pixels would do the trick?

Jan 05 '22 14:01 dumblob

that would work, but it is strictly harder for streaming encoders.

Jan 05 '22 14:01 oscardssmith

I wonder if there's a way to have a well-performing fixed-size previous-line cache. Alternatively it would probably be good enough to set a max size of the previous-line cache to something like 4096 pixels, that way 99% of input benefits fully from averaging and the rest can at least benefit from averaging for the first 4096 pixels in every row (OTOH, it may not work in practice).

Since you would need the pixel right above the current one you need to cache a full width of pixels. The speed would probably acceptable since you could use a FIFO / ring buffer to cache the pixels. You would need to allocate the memory for that so for an image with a resolution of 20,000,000 x 2 pixels it wouldn't be very useful but for the majority of images it would work fine. An 8K image ( 7680x4320 pixels ) would require a cache of 22.5 KiB (caching alpha channel isn't required).

Jan 05 '22 14:01 wbd73

that would work, but it is strictly harder for streaming encoders.

Is it? I mean, the simpliest solution (to ignore tile boundaries for optimization/compression and optimize/compress only each tile individually) could be viewed as a list of small unrelated pictures (of course some operations like dithering won't work this way without artifacts, but that's not what's being discussed in this thread, right?).

That sounds darn streaming-oriented and quite easy to implement (not that I'd have time to try it out now - it's just my thinking aloud).

Jan 05 '22 15:01 dumblob

An 8K image ( 7680x4320 pixels ) would require a cache of 22.5 KiB (caching alpha channel isn't required).

My proposal fixes the cache to some size (lets say 8192 pixels which sounds reasonable as an upper bound of expected input size). This would mean the 7680x4320 image uses an averaged reference everywhere, as would a 7680x20,000,000 image. The 20,000,000 x 2 image would use averaging for the first 8192 pixels in each row, previous pixel for the rest where it has no upper reference. If it works it would allow averaging for 99% of input without screwing streaming encoders or forcing high memory requirements. The main drawback is introducing a condition for the boundary of high-width input. The branch predictor can probably be leaned on instead of complicating the implementation as for 99% of input a branch predictor should effectively nullify a branch that only ever goes one way(?).

Jan 05 '22 15:01 chocolate42

The 20,000,000 x 2 image would use averaging for the first 8192 pixels in each row, previous pixel for the rest where it has no upper reference. If it works it would allow averaging for 99% of input without screwing streaming encoders or forcing high memory requirements. The main drawback is introducing a condition for the boundary of high-width input.

That should work. With such an approach, it might make sense to use en outer loop for the height with an inner loop for the width instead of a single loop for all pixels.

Jan 05 '22 18:01 wbd73

I thought about the rounding of the average a bit more. In my previous code I used (px1 + px2) >> 1 so a .5 value was rounded down. Since the delta values can adjust down 1 more compared to up, it makes more sense to round in a normal way. (px1 + px2 + 1) >> 1 so a .5 value is rounded up. By making this small adjustment, 'images' drops from 405 kb to 404 and 'images-lance' from 1709 to 1706. Not much of a difference but worth to change the rounding I think. qoi2avg2.h.txt

Jan 06 '22 06:01 wbd73

Here's my numbers for 'regular' qoi2avg2.h.txt (which I've called qoi-reg-avg2). I also wrapped LZ4 around it (called qoi-lz4-avg2). For comparison, qoi-mas-c04 (commit c04a975 from the phoboslab/master branch, 2021-12-24) and qoi-lz4-d28 are from https://github.com/nigeltao/qoi2-bikeshed/issues/28#issuecomment-1002079802

qoi-lz4-avg2 compresses better than libpng on the images test suite (with 2.5x faster decode, 20x faster encode)! Performance could probably be optimized, too.

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate

## icon_512/apps-preferences-desktop-locale.png size: 512x512
libpng:       5.1        23.6         51.08         11.09        25    2.5%
qoi-mas-c04:  1.9         1.8        141.22        148.71       186   18.2%
qoi-reg-avg2: 2.1         2.4        127.80        110.18       171   16.8%
qoi-lz4-d28:  1.6         1.7        162.76        152.28        27    2.7%
qoi-lz4-avg2: 2.0         2.4        131.81        107.72        38    3.7%

## Total for images/icon_512
libpng:       4.9        33.7         52.98          7.78        51    5.0%
qoi-mas-c04:  1.4         1.4        182.31        182.77        85    8.4%
qoi-reg-avg2: 1.5         1.8        180.13        142.30        76    7.5%
qoi-lz4-d28:  0.9         1.6        289.36        163.47        50    4.9%
qoi-lz4-avg2: 1.5         2.0        175.10        128.33        55    5.4%

## Total for images/photo_kodak
libpng:      15.3       192.1         25.78          2.05       648   56.3%
qoi-mas-c04:  5.6         7.6         69.91         51.85       671   58.3%
qoi-reg-avg2: 7.3         9.6         53.84         41.08       600   52.2%
qoi-lz4-d28:  6.1         8.0         64.25         49.44       651   56.5%
qoi-lz4-avg2: 7.2         9.9         54.98         39.69       600   52.2%

## Total for images/screenshot_web
libpng:      92.6      1045.6         87.78          7.77      2402    7.6%
qoi-mas-c04: 50.1        47.1        162.16        172.58      2649    8.3%
qoi-reg-avg2:51.2        63.1        158.55        128.84      2302    7.3%
qoi-lz4-d28: 30.6        55.9        265.65        145.24      2080    6.6%
qoi-lz4-avg2:46.6        64.5        174.28        125.92      2070    6.5%

# Grand total for images
libpng:      13.5       153.7         34.44          3.02       398   24.2%
qoi-mas-c04:  4.7         5.6         99.25         82.46       463   28.2%
qoi-reg-avg2: 5.4         7.1         86.37         65.40       404   24.6%
qoi-lz4-d28:  4.2         6.0        110.38         77.11       416   25.4%
qoi-lz4-avg2: 5.4         7.5         85.54         61.86       391   23.8%

# Grand total for images-lance
libpng:     103.5       925.0         39.25          4.39      1395    9.0%
qoi-mas-c04: 35.2        32.5        115.26        124.91      2109   13.6%
qoi-reg-avg2:37.7        40.5        107.59        100.39      1706   11.0%
qoi-lz4-d28: 28.7        36.2        141.62        112.13      1530    9.8%
qoi-lz4-avg2:38.2        42.9        106.27         94.71      1590   10.2%

qoi-lz4-avg2.h.txt

Compile with:

gcc -I ../stb qoibench.c -llz4 -lpng -O3

Jan 09 '22 05:01 nigeltao

The Gamut library https://github.com/AuburnSounds/gamut provides both QOI format and "QOIX", an evolving format which is currently qoi2avg.h from @wbd73 followed by LZ4 compression (ie, it is qoi-lz4-avg2 without the rounding fix).

@wbd73 Would you disclose your real name in order to have proper copyright information?

Jul 11 '22 09:07 p0nce

@wbd73 Would you disclose your real name in order to have proper copyright information?

The way you did it now is fine with me. Some ideas (luma232 with a different bias and using the average of two pixels as a prediction) came from me but using a FIFO for the color index was someone else idea.

Jul 12 '22 13:07 wbd73

My QOIR repo is a work-in-progress, but FYI when I change the predictor from (left) to ((1+left+above)/2),

I can reproduce the slow-down in encode and decode speed, but
I can claw almost all of it back by using SIMD SSE2 (encode, decode), but
SIMD is obviously less portable.

Jul 13 '22 13:07 nigeltao

I can claw almost all of it back by using SIMD SSE2 (encode, decode), but

SIMD is obviously less portable.

If you use a union so you can access the four uint8 bytes of a pixel also as a 32 bit integer value, you can use SWAR (SIMD within a register). It's a bit of an in between solution that is more portable.

/* SWAR */
#define QOI_PAVGB(P0, P1) ((P0 | P1) - (((P0 ^ P1) >> 1) & 0x7f7f7f7f))
#define QOI_PSUBB(P0, P1) ((P0 | 0x80808080) - (P1 & 0x7f7f7f7f)) ^ ((P0 ^ ~P1) & 0x80808080)

In this case if P0 is one pixel represented as a 32 bit integer value and P1 is another pixel, QOI_PAVGB returns a 32 bit integer where each byte is the average of the specific byte of the two pixels (rounded up as preferred). QOI_PSUBB works in a similar way but returns the difference.

Jul 13 '22 13:07 wbd73

qoi2-bikeshed qoi2-bikeshed copied to clipboard

Diff from average instead of previous pixel

qoi2-bikeshed
qoi2-bikeshed copied to clipboard