imageproc How best to add parallelism?

Lots of the functions in this library are embarrassingly parallel. How should we parallelise them? Ideally with as little impact on the function bodies as possible. Image-chunk-iterators plus https://github.com/nikomatsakis/rayon?

Jan 11 '16 05:01 theotherphil

On the get side, for most functions we could simply change for y in 0..height into a rayon equivalent. IMHO, pb is on put side, ie sharing mutable ImageBuffer:

either we stop returning ImageBuffer and return iterators
- pros:
  - we could probably avoid a lot of unsafe_put_pixel
  - perhaps (?) we could chain several fns in an efficient way?
- cons:
  - much more complicated
  - we don't reserve memory upfront
or we add a chunk_mut like method on ImageBuffer itself

Jan 11 '16 09:01 tafia

Yeah, I've thought (very little) about lazy image operations before, but haven't looked into how feasible this is. There are several places where returning iterators could be useful in order to avoid allocating intermediates at all, so the second con could be a positive.

Adding the ability to get mutable references to disjoint sections of an image sounds like the most straightforward approach, but I'm really not qualified to give much useful input on this issue.

Jan 12 '16 21:01 theotherphil

What about providing implementations that run on the GPU, i.e. with openCL or OpenGL compute shaders?

Feb 26 '16 01:02 nicokoch

That would be great to have. Is there anything in the rust ecosystem already which would help with this? All I can see from a quick Google is https://github.com/luqmana/rust-opencl.

Not directly related, but you might also want to have a look at the discussion here: https://github.com/PistonDevelopers/imageproc/issues/3.

Feb 26 '16 05:02 theotherphil

The work on vulkan might probably help, even if it means going into the unknown. On 26 Feb 2016 13:48, "theotherphil" [email protected] wrote:

That would be great to have. Is there anything in the rust ecosystem already which would help with this? All I can see from a quick Google is https://github.com/luqmana/rust-opencl.

— Reply to this email directly or view it on GitHub https://github.com/PistonDevelopers/imageproc/issues/98#issuecomment-189123392 .

Feb 26 '16 06:02 tafia

A totally different, possibly terrible, idea is to write a rust front-end for http://halide-lang.org/.

Feb 26 '16 07:02 theotherphil

https://github.com/tomaka/glium is a great OpenGL wrapper for rust (I contribute there a lot). It has support for OpenGL compute shaders, which are kind of an alternative to OpenCL.

Compute shaders are an OpenGL 4.3 feature though, so not all drivers support them, but neither do all drivers support OpenCL. So GPU acceleration must be a strictly optional feature.

Feb 26 '16 13:02 nicokoch

I also like the vulkan approach a lot. See https://github.com/tomaka/vulkano for that.

Since shaders in vulkan and opengl are basically the same, this will also have to use compute shaders. The big upside of using vulkan is greater control (i.e. with regards to memory)

Feb 26 '16 13:02 nicokoch

This sounds like a great feature to have, but it's not something I know anything about. There's already a "performance" branch, which seems like a suitable place for any experiments along these lines.

Feb 26 '16 20:02 theotherphil

It sounds like a fun task. I will try to put something together in the next days and maybe have some basic benchmarks to see if it's worth the effort.

Feb 26 '16 20:02 nicokoch

Nice, looking forward to seeing it.

Feb 26 '16 20:02 theotherphil

Ok, just to give an update on how I've come so far:

Since I'm not willing to run a nightly mesa, I will wait for the 11.3 release to implement a vulkan-based approach. In the meantime I have tinkered with rayon a bit. Turns out the synchronisation overhead causes the pixel iterations to actually slow down when using rayon. Here are the benchmark results when using a parallel canny vs the current (sequential) implementation on a 1250 * 1250 pixel image:

Sequential:
running 1 test
test edges::test::bench_canny                                           ... bench: 578,696,313 ns/iter (+/- 5,821,699)

Parallel:
running 1 test
test edges::test::bench_canny                                           ... bench: 865,148,872 ns/iter (+/- 12,552,160)

I assume this is due to synchronization overhead. With rayon, we have to lock a Mutex (or RwLock) once per pixel. Because usually the work done per pixel is very low, this adds a lot more overhead than the parallel implementation makes it faster.

I will continue doing some research in this area (i.e. splitting an output image into columns and merging them after processing might be a decent approach. Another viable option might be crossbeams lockless data structures, but not sure if they are applicable)

Mar 03 '16 22:03 nicokoch

Thanks for the update. Yeah, synchronising per pixel sounds pretty painful. If you get something working with chunks (or columns) we can hopefully nick the idea for a lot of other functions too.

Mar 03 '16 22:03 theotherphil

It turns out that most of the time in the canny benchmark is spent in image::imageops::blur. I've been meaning to implement efficient approximate Gaussian blur via iterated box filters for a while and haven't got round to it. However, you can get a big speed up over the imageops version just by calculating a kernel of f32s and doing the separable filter directly. I'll commit this when I get home today.

Mar 06 '16 12:03 theotherphil

I'll commit this when I get home today.

Hi, I was just wondering if you ever got ~~home 😅~~ around to this?

Mar 16 '21 16:03 RReverser

imageproc imageproc copied to clipboard

How best to add parallelism?

imageproc
imageproc copied to clipboard