blurhash Speedup C encoder up to 100x

All changes are divided by independent commits, some of them are optional.

In addition to improving performance there are changes:

Do not define M_PI in sources, ensure it defined in math.h.
Fixed max number of components for blurhash_encoder executable (in line with blurHashForPixels function)
Improved Makefile to avoid heavy encode_stb recompilation on each change.

Benchmarks are in the comment.

Sep 25 '24 08:09 homm

~~I've also implemented SSE and NEON optimizations in separate branch.~~ The last optimization with unrolling loop in multiplyBasisFunction is actually works better since it allows any compiler effectively autovectorize the code. There are benchmarks for 2000 × 1334 jpeg image on different systems:

Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz

Optimization	GCC 13.2.1		Clang 17.0.6
	6 4	9 9	6 4	9 9
Master	3181 ms	11844 ms	3154 ms	11124 ms
sRGBToLinear_cache	381	1507	451	1633
cosX cache	82	339	88	270
Single pass	58	177	62	207
~~SSE~~ (obsolete)	39	114	42	144
Unroll 4x	30	80	32	85

Apple M1 Pro

Optimization	GCC 13.2.1		Clang 17.0.6		Clang 14.0.3
	6 4	9 9	6 4	9 9	6 4	9 9
Master	1177 ms	4076 ms	1156 ms	4005 ms	1268 ms	4302 ms
sRGBToLinear_cache	212	826	216	839	186	653
cosX cache	44	150	80	271	81	271
Single pass	20	62	32	57	29	70
~~NEON~~ (obsolete)	27	87	25	80	25	80
Unroll 4x	16	49	15	43	15	42

* Result for M1 Pro was fixed, since previous results was affected by the bug.

Oct 03 '24 20:10 homm

@DagAgren Are you interested in this improvements?

Oct 11 '24 12:10 homm

I also improved decoder performance about 14 times using the same techniques: caching cos values, linearTosRGB values and unrolling loops. This improves performance of decoding from 6 Mpx/s to 86 Mpx/s on M1.

$ touch decode.c && make blurhash_decoder && ./blurhash_decoder "W7E-z7oyM{8xM{wKwdMepHrE%LV[OVV@BBS\$r@NaR7OrRQNaMKXm" 640 480 _out.png
Time per 30 execution: 49.532 ms

$ touch decode.c && make blurhash_decoder && ./blurhash_decoder "W7E-z7oyM{8xM{wKwdMepHrE%LV[OVV@BBS\$r@NaR7OrRQNaMKXm" 640 480 _out.png
Time per 30 execution: 3.573 ms

This also introduces very minor change in output result. Nothing that could be noticed by human eye, just different binary output.

The method which I use to measure performance is following:

diff --git forkSrcPrefix/C/encode_stb.c forkDstPrefix/C/encode_stb.c
index 811ca00006b45eaa829bfd267904ac0d0c647884..a95c6a2ff96ee7cdaa9d1b35ef28b063161cf01d 100644
--- forkSrcPrefix/C/encode_stb.c
+++ forkDstPrefix/C/encode_stb.c
@@ -4,6 +4,7 @@
 #include "stb_image.h"
 
 #include <stdio.h>
+#include <time.h>
 
 const char *blurHashForFile(int xComponents, int yComponents,const char *filename);
 
@@ -38,6 +39,14 @@ const char *blurHashForFile(int xComponents, int yComponents,const char *filenam
 
 	const char *hash = blurHashForPixels(xComponents, yComponents, width, height, data, width * 3);
 
+	#define TIMES 30
+	clock_t start = clock();
+    for (int i = 0; i < TIMES; i++) {
+        hash = blurHashForPixels(xComponents, yComponents, width, height, data, width * 3);
+    }
+    double time_ms = (double)(clock() - start) / CLOCKS_PER_SEC / TIMES;
+    printf("Time per %d execution: %.3f ms\n", TIMES, time_ms * 1000);
+
 	stbi_image_free(data);
 
 	return hash;
diff --git forkSrcPrefix/C/decode_stb.c forkDstPrefix/C/decode_stb.c
index dab164e1eaf1a7199a751a5e13f6da7099027bd2..3514f53e6f91dc41253429ea07e594893d536598 100644
--- forkSrcPrefix/C/decode_stb.c
+++ forkDstPrefix/C/decode_stb.c
@@ -3,6 +3,8 @@
 #define STB_IMAGE_WRITE_IMPLEMENTATION
 #include "stb_writer.h"
 
+#include <time.h>
+
 int main(int argc, char **argv) {
 	if(argc < 5) {
 		fprintf(stderr, "Usage: %s hash width height output_file [punch]\n", argv[0]);
@@ -34,6 +36,15 @@ int main(int argc, char **argv) {
 
 	freePixelArray(bytes);
 
+	#define TIMES 30
+	clock_t start = clock();
+    for (int i = 0; i < TIMES; i++) {
+    	uint8_t * tmpbytes = decode(hash, width, height, punch, nChannels);
+    	freePixelArray(tmpbytes);
+    }
+    double time_ms = (double)(clock() - start) / CLOCKS_PER_SEC / TIMES;
+    printf("Time per %d execution: %.3f ms\n", TIMES, time_ms * 1000);
+
 	fprintf(stdout, "Decoded blurhash successfully, wrote PNG file %s\n", output_file);
 	return 0;
 }

Oct 24 '24 09:10 homm

@DagAgren How can I earn your attention?

Oct 30 '24 18:10 homm

@DagAgren please note that We will be very grateful for the optimization of the algorithm

Dec 04 '24 15:12 vellnes

This is a breakthrough for this library. Why can't we merge it? @DagAgren ?

Oct 28 '25 22:10 jonybekov

Sorry I did not see this earlier. However, this code is written intentionally to be simple rather than performant, because it meant as a reference implementation that can be as easily ported as possible to other platforms.

Also, it should not need high performance. You should not run it on a full-sized image, but instead first scale the image down to a much smaller size, such as 32x32, and run it on that. This is mentioned in the documentation. Running it on a full-scale image is not useful, as it throws away all that detail anyway.

Oct 29 '25 16:10 DagAgren

However, this code is written intentionally to be simple rather than performant

Does this mean you’re rejecting any performance improvements entirely, or only the more radical ones (like 4× loop unrolling)?

Regarding the suggestion to scale the image down to 32×32 — that almost eliminates any benefit from sRGB → linear conversion.

Performance improvements are still measurable even at that size. I used large images only to better demonstrate the effect; the same applies to small ones.

Oct 30 '25 09:10 homm