ssimulacra2
ssimulacra2 copied to clipboard
Be faster
The blur in particular is still rather slow despite our many efforts to optimize it
The way I see it, there are two main possibilities to improve the blurring speed: Either choose a different algorithm or try to optimize the current algorithm further.
The original algo for the gaussian blur came from libjxl, maybe they did tests and already have justification on why they use that particular implementation. I couldn't find it quickly but it probably exists somewhere.
With regards to optimizing the algo further, the focus should probably not be CPU time, but memory accesses. Which is going to be pretty hard IMO, since the compiler is already doing a good job. I don't think there are any easy big gains left here, at least for x86.
I had some free time recently and really looked at it again, and I noticed something: All[^1] of the SIMD (AVX, to be specific) instructions are the ss (scalar single) variant, and not the ps (packed scalar) one. This means that we're effectively using the xmm registers to hold ONE float, and are just using the register to use the FMA instructions.
We should be able to get at least the horizontal pass to do packed instructions. Though I'm sure it'd be fairly difficult, because of the way the algorithm accesses memory (bounds "checking", 2 accesses per pixel). Maybe libjxl can serve as an inspiration on how to do it.
[^1]: with some relatively unimportant exceptions like mov and xor