JUCE icon indicating copy to clipboard operation
JUCE copied to clipboard

Add an implementation of Stack Blur as a `juce::ImageEffectFilter`

Open ImJimmi opened this issue 3 years ago • 3 comments

Adds an implementation of the Stack Blur algorithm as described here: https://observablehq.com/@jobleonard/mario-klingemans-stackblur

Stack blur is a significantly faster blurring algorithm than the existing Gaussian blur, especially at higher blur radiuses. Here's a graph showing the render time in Milliseconds for increasing blur radiuses for Gaussian Blur, multi-threaded stack blur, and single-threaded stack blur:

Screenshot 2022-04-04 at 15 36 11

Note how the Y-axis is logarithmic. Even without the use of the thread pool, Stack Blur is around 7x faster than Gaussian Blur at a blur radius of 25px. With the thread pool, Stack Blur is around 38x faster than Gaussian Blur.

To look at it another way, the maximum framerate you'd get from the Gaussian blur at 25 blur radius would be ~1.3FPS. With Stack Blur using a thread pool also with 25 blur radius you could achieve ~48FPS.


Stack Blur (top) also gives a much "smoother" blur than Gaussian (bottom), which tends to 'smudge' elements, especially noticeable at the edges of images:

comparison


This PR also adds a new Blur Demo to the Demo Runner example project. The demo shows the differences between the two available blur techniques in JUCE and their respective render times, with a slider to adjust the blur radius.

Screenshot 2022-04-04 at 15 58 19


This is an extension of a previous PR made here: https://github.com/juce-framework/JUCE/pull/934. However this version is written from scratch in a more JUCEy way.

This work was initially inspired by this thread on the forums: https://forum.juce.com/t/faster-blur-glassmorphism-ui/43086

ImJimmi avatar Apr 04 '22 15:04 ImJimmi

@ImJimmi I'm pretty excited by the work you've done here. I have my own implementation of stack blur, but I never did the profiling.

With Stack Blur using a thread pool also with 25 blur radius you could achieve ~48FPS.

I'm wondering what size image is this on? It would be nice to know if blurring X amount of pixels is possible at > 60fps...

Also, do you run into any further optimization ideas? Web browsers do this very efficiently, but they are GPU based.

Ran into this cool thing too tho: https://developer.chrome.com/blog/animated-blur/

sudara avatar Jun 09 '22 21:06 sudara

@sudara

I'm wondering what size image is this on? It would be nice to know if blurring X amount of pixels is possible at > 60fps...

Those benchmarks were with the new demo I added to the DemoRunner using whatever size the component is - I think about 640x480 based on the screenshot above.

60fps would be easily achievable on a smaller component, say a drop-down menu. But even with the best blurring algorithm, blurring a large image on a CPU is not going to be particularly performant.

Also, do you run into any further optimization ideas? Web browsers do this very efficiently, but they are GPU based.

The two biggest performance gains (other than the Stack Blur algorithm itself) came from:

  • Using a thread-pool which allows for multiple chunks of the image to be processed in parallel - I've @aceaudio to thank for that idea!
  • ~Keeping dynamic allocations to a minimum. Initially I was dynamically only allocating the minimum amount of memory required using a juce::Array, however that meant allocating that memory for every row and column of the image on the heap. By limiting the blur radius and then always allocating the max amount of memory required on the stack, you get much better performance.~ Oh, looks like I changed that again in the final version... that might be something to look at again.

The only other thing I did look at was trying to align the data so you're always accessing it in order. Reading from a container by iterating one index at a time is much quicker than stepping over large chunks. However that's exactly what you have to do to process each column of the image, you read one pixel and then jump NUM_PIXELS_WIDE forward in the memory to get the pixel below it. So I tried to align the memory beforehand so you're always reading one pixel after the next, however that was actually slower in the end since the overhead of aligning the data was more than inefficiency of reading the non-aligned data.

ImJimmi avatar Jun 15 '22 11:06 ImJimmi

Those benchmarks were with the new demo I added to the DemoRunner using whatever size the component is - I think about 640x480 based on the screenshot above.

Ok, great! Just wanted to confirm that the benchmarks were from that!

Using a thread-pool which allows for multiple chunks of the image to be processed in parallel

This is pretty interesting, wouldn't have thought of it!

The only other thing I did look at was trying to align the data so you're always accessing it in order.

Yeah, I was also wondering if there was an opportunity for matrix multiplication here. But without an ability to run that on a gpu... maybe no gainz possible.

The only other thing that I've noticed is impactful on my end is providing tie-in to JUCE's components.

For example, I use stack blur for drop shadows and there's all sorts of optimization techniques like having the shadow be in its own container, cached by setBufferedToImage etc. To some degree, this sort of optimization could be under the hood (avoid re-calculating the same blur over and over during painting).

sudara avatar Jun 15 '22 16:06 sudara