blurhash icon indicating copy to clipboard operation
blurhash copied to clipboard

Speedup C encoder up to 100x

Open homm opened this issue 1 year ago • 5 comments

All changes are divided by independent commits, some of them are optional.

In addition to improving performance there are changes:

  • Do not define M_PI in sources, ensure it defined in math.h.
  • Fixed max number of components for blurhash_encoder executable (in line with blurHashForPixels function)
  • Improved Makefile to avoid heavy encode_stb recompilation on each change.

Benchmarks are in the comment.

homm avatar Sep 25 '24 08:09 homm

~~I've also implemented SSE and NEON optimizations in separate branch.~~ The last optimization with unrolling loop in multiplyBasisFunction is actually works better since it allows any compiler effectively autovectorize the code. There are benchmarks for 2000 × 1334 jpeg image on different systems:

Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz

Optimization GCC 13.2.1 Clang 17.0.6
6 4 9 9 6 4 9 9
Master 3181 ms 11844 ms 3154 ms 11124 ms
sRGBToLinear_cache 381 1507 451 1633
cosX cache 82 339 88 270
Single pass 58 177 62 207
~~SSE~~ (obsolete) 39 114 42 144
Unroll 4x 30 80 32 85

Apple M1 Pro

Optimization GCC 13.2.1 Clang 17.0.6 Clang 14.0.3
6 4 9 9 6 4 9 9 6 4 9 9
Master 1177 ms 4076 ms 1156 ms 4005 ms 1268 ms 4302 ms
sRGBToLinear_cache 212 826 216 839 186 653
cosX cache 44 150 80 271 81 271
Single pass 20 62 32 57 29 70
~~NEON~~ (obsolete) 27 87 25 80 25 80
Unroll 4x 16 49 15 43 15 42

* Result for M1 Pro was fixed, since previous results was affected by the bug.

homm avatar Oct 03 '24 20:10 homm

@DagAgren Are you interested in this improvements?

homm avatar Oct 11 '24 12:10 homm

I also improved decoder performance about 14 times using the same techniques: caching cos values, linearTosRGB values and unrolling loops. This improves performance of decoding from 6 Mpx/s to 86 Mpx/s on M1.

$ touch decode.c && make blurhash_decoder && ./blurhash_decoder "W7E-z7oyM{8xM{wKwdMepHrE%LV[OVV@BBS\$r@NaR7OrRQNaMKXm" 640 480 _out.png
Time per 30 execution: 49.532 ms

$ touch decode.c && make blurhash_decoder && ./blurhash_decoder "W7E-z7oyM{8xM{wKwdMepHrE%LV[OVV@BBS\$r@NaR7OrRQNaMKXm" 640 480 _out.png
Time per 30 execution: 3.573 ms

This also introduces very minor change in output result. Nothing that could be noticed by human eye, just different binary output.

The method which I use to measure performance is following:

diff --git forkSrcPrefix/C/encode_stb.c forkDstPrefix/C/encode_stb.c
index 811ca00006b45eaa829bfd267904ac0d0c647884..a95c6a2ff96ee7cdaa9d1b35ef28b063161cf01d 100644
--- forkSrcPrefix/C/encode_stb.c
+++ forkDstPrefix/C/encode_stb.c
@@ -4,6 +4,7 @@
 #include "stb_image.h"
 
 #include <stdio.h>
+#include <time.h>
 
 const char *blurHashForFile(int xComponents, int yComponents,const char *filename);
 
@@ -38,6 +39,14 @@ const char *blurHashForFile(int xComponents, int yComponents,const char *filenam
 
 	const char *hash = blurHashForPixels(xComponents, yComponents, width, height, data, width * 3);
 
+	#define TIMES 30
+	clock_t start = clock();
+    for (int i = 0; i < TIMES; i++) {
+        hash = blurHashForPixels(xComponents, yComponents, width, height, data, width * 3);
+    }
+    double time_ms = (double)(clock() - start) / CLOCKS_PER_SEC / TIMES;
+    printf("Time per %d execution: %.3f ms\n", TIMES, time_ms * 1000);
+
 	stbi_image_free(data);
 
 	return hash;
diff --git forkSrcPrefix/C/decode_stb.c forkDstPrefix/C/decode_stb.c
index dab164e1eaf1a7199a751a5e13f6da7099027bd2..3514f53e6f91dc41253429ea07e594893d536598 100644
--- forkSrcPrefix/C/decode_stb.c
+++ forkDstPrefix/C/decode_stb.c
@@ -3,6 +3,8 @@
 #define STB_IMAGE_WRITE_IMPLEMENTATION
 #include "stb_writer.h"
 
+#include <time.h>
+
 int main(int argc, char **argv) {
 	if(argc < 5) {
 		fprintf(stderr, "Usage: %s hash width height output_file [punch]\n", argv[0]);
@@ -34,6 +36,15 @@ int main(int argc, char **argv) {
 
 	freePixelArray(bytes);
 
+	#define TIMES 30
+	clock_t start = clock();
+    for (int i = 0; i < TIMES; i++) {
+    	uint8_t * tmpbytes = decode(hash, width, height, punch, nChannels);
+    	freePixelArray(tmpbytes);
+    }
+    double time_ms = (double)(clock() - start) / CLOCKS_PER_SEC / TIMES;
+    printf("Time per %d execution: %.3f ms\n", TIMES, time_ms * 1000);
+
 	fprintf(stdout, "Decoded blurhash successfully, wrote PNG file %s\n", output_file);
 	return 0;
 }

homm avatar Oct 24 '24 09:10 homm

@DagAgren How can I earn your attention?

homm avatar Oct 30 '24 18:10 homm

@DagAgren please note that We will be very grateful for the optimization of the algorithm

vellnes avatar Dec 04 '24 15:12 vellnes

This is a breakthrough for this library. Why can't we merge it? @DagAgren ?

jonybekov avatar Oct 28 '25 22:10 jonybekov

Sorry I did not see this earlier. However, this code is written intentionally to be simple rather than performant, because it meant as a reference implementation that can be as easily ported as possible to other platforms.

Also, it should not need high performance. You should not run it on a full-sized image, but instead first scale the image down to a much smaller size, such as 32x32, and run it on that. This is mentioned in the documentation. Running it on a full-scale image is not useful, as it throws away all that detail anyway.

DagAgren avatar Oct 29 '25 16:10 DagAgren

However, this code is written intentionally to be simple rather than performant

Does this mean you’re rejecting any performance improvements entirely, or only the more radical ones (like 4× loop unrolling)?

Regarding the suggestion to scale the image down to 32×32 — that almost eliminates any benefit from sRGB → linear conversion.

Performance improvements are still measurable even at that size. I used large images only to better demonstrate the effect; the same applies to small ones.

homm avatar Oct 30 '25 09:10 homm