Speedup C encoder up to 100x
All changes are divided by independent commits, some of them are optional.
In addition to improving performance there are changes:
- Do not define
M_PIin sources, ensure it defined inmath.h. - Fixed max number of components for
blurhash_encoderexecutable (in line withblurHashForPixelsfunction) - Improved
Makefileto avoid heavyencode_stbrecompilation on each change.
Benchmarks are in the comment.
~~I've also implemented SSE and NEON optimizations in separate branch.~~ The last optimization with unrolling loop in multiplyBasisFunction is actually works better since it allows any compiler effectively autovectorize the code. There are benchmarks for 2000 × 1334 jpeg image on different systems:
Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
| Optimization | GCC 13.2.1 | Clang 17.0.6 | ||
|---|---|---|---|---|
| 6 4 | 9 9 | 6 4 | 9 9 | |
| Master | 3181 ms | 11844 ms | 3154 ms | 11124 ms |
| sRGBToLinear_cache | 381 | 1507 | 451 | 1633 |
| cosX cache | 82 | 339 | 88 | 270 |
| Single pass | 58 | 177 | 62 | 207 |
| ~~SSE~~ (obsolete) | 39 | 114 | 42 | 144 |
| Unroll 4x | 30 | 80 | 32 | 85 |
Apple M1 Pro
| Optimization | GCC 13.2.1 | Clang 17.0.6 | Clang 14.0.3 | |||
|---|---|---|---|---|---|---|
| 6 4 | 9 9 | 6 4 | 9 9 | 6 4 | 9 9 | |
| Master | 1177 ms | 4076 ms | 1156 ms | 4005 ms | 1268 ms | 4302 ms |
| sRGBToLinear_cache | 212 | 826 | 216 | 839 | 186 | 653 |
| cosX cache | 44 | 150 | 80 | 271 | 81 | 271 |
| Single pass | 20 | 62 | 32 | 57 | 29 | 70 |
| ~~NEON~~ (obsolete) | 27 | 87 | 25 | 80 | 25 | 80 |
| Unroll 4x | 16 | 49 | 15 | 43 | 15 | 42 |
* Result for M1 Pro was fixed, since previous results was affected by the bug.
@DagAgren Are you interested in this improvements?
I also improved decoder performance about 14 times using the same techniques: caching cos values, linearTosRGB values and unrolling loops. This improves performance of decoding from 6 Mpx/s to 86 Mpx/s on M1.
$ touch decode.c && make blurhash_decoder && ./blurhash_decoder "W7E-z7oyM{8xM{wKwdMepHrE%LV[OVV@BBS\$r@NaR7OrRQNaMKXm" 640 480 _out.png
Time per 30 execution: 49.532 ms
$ touch decode.c && make blurhash_decoder && ./blurhash_decoder "W7E-z7oyM{8xM{wKwdMepHrE%LV[OVV@BBS\$r@NaR7OrRQNaMKXm" 640 480 _out.png
Time per 30 execution: 3.573 ms
This also introduces very minor change in output result. Nothing that could be noticed by human eye, just different binary output.
The method which I use to measure performance is following:
diff --git forkSrcPrefix/C/encode_stb.c forkDstPrefix/C/encode_stb.c
index 811ca00006b45eaa829bfd267904ac0d0c647884..a95c6a2ff96ee7cdaa9d1b35ef28b063161cf01d 100644
--- forkSrcPrefix/C/encode_stb.c
+++ forkDstPrefix/C/encode_stb.c
@@ -4,6 +4,7 @@
#include "stb_image.h"
#include <stdio.h>
+#include <time.h>
const char *blurHashForFile(int xComponents, int yComponents,const char *filename);
@@ -38,6 +39,14 @@ const char *blurHashForFile(int xComponents, int yComponents,const char *filenam
const char *hash = blurHashForPixels(xComponents, yComponents, width, height, data, width * 3);
+ #define TIMES 30
+ clock_t start = clock();
+ for (int i = 0; i < TIMES; i++) {
+ hash = blurHashForPixels(xComponents, yComponents, width, height, data, width * 3);
+ }
+ double time_ms = (double)(clock() - start) / CLOCKS_PER_SEC / TIMES;
+ printf("Time per %d execution: %.3f ms\n", TIMES, time_ms * 1000);
+
stbi_image_free(data);
return hash;
diff --git forkSrcPrefix/C/decode_stb.c forkDstPrefix/C/decode_stb.c
index dab164e1eaf1a7199a751a5e13f6da7099027bd2..3514f53e6f91dc41253429ea07e594893d536598 100644
--- forkSrcPrefix/C/decode_stb.c
+++ forkDstPrefix/C/decode_stb.c
@@ -3,6 +3,8 @@
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_writer.h"
+#include <time.h>
+
int main(int argc, char **argv) {
if(argc < 5) {
fprintf(stderr, "Usage: %s hash width height output_file [punch]\n", argv[0]);
@@ -34,6 +36,15 @@ int main(int argc, char **argv) {
freePixelArray(bytes);
+ #define TIMES 30
+ clock_t start = clock();
+ for (int i = 0; i < TIMES; i++) {
+ uint8_t * tmpbytes = decode(hash, width, height, punch, nChannels);
+ freePixelArray(tmpbytes);
+ }
+ double time_ms = (double)(clock() - start) / CLOCKS_PER_SEC / TIMES;
+ printf("Time per %d execution: %.3f ms\n", TIMES, time_ms * 1000);
+
fprintf(stdout, "Decoded blurhash successfully, wrote PNG file %s\n", output_file);
return 0;
}
@DagAgren How can I earn your attention?
@DagAgren please note that We will be very grateful for the optimization of the algorithm
This is a breakthrough for this library. Why can't we merge it? @DagAgren ?
Sorry I did not see this earlier. However, this code is written intentionally to be simple rather than performant, because it meant as a reference implementation that can be as easily ported as possible to other platforms.
Also, it should not need high performance. You should not run it on a full-sized image, but instead first scale the image down to a much smaller size, such as 32x32, and run it on that. This is mentioned in the documentation. Running it on a full-scale image is not useful, as it throws away all that detail anyway.
However, this code is written intentionally to be simple rather than performant
Does this mean you’re rejecting any performance improvements entirely, or only the more radical ones (like 4× loop unrolling)?
Regarding the suggestion to scale the image down to 32×32 — that almost eliminates any benefit from sRGB → linear conversion.
Performance improvements are still measurable even at that size. I used large images only to better demonstrate the effect; the same applies to small ones.