imageproc
imageproc copied to clipboard
How best to add parallelism?
Lots of the functions in this library are embarrassingly parallel. How should we parallelise them? Ideally with as little impact on the function bodies as possible. Image-chunk-iterators plus https://github.com/nikomatsakis/rayon?
On the get side, for most functions we could simply change for y in 0..height
into a rayon equivalent.
IMHO, pb is on put side, ie sharing mutable ImageBuffer
:
- either we stop returning
ImageBuffer
and return iterators- pros:
- we could probably avoid a lot of
unsafe_put_pixel
- perhaps (?) we could chain several fns in an efficient way?
- we could probably avoid a lot of
- cons:
- much more complicated
- we don't reserve memory upfront
- pros:
- or we add a
chunk_mut
like method on ImageBuffer itself
Yeah, I've thought (very little) about lazy image operations before, but haven't looked into how feasible this is. There are several places where returning iterators could be useful in order to avoid allocating intermediates at all, so the second con could be a positive.
Adding the ability to get mutable references to disjoint sections of an image sounds like the most straightforward approach, but I'm really not qualified to give much useful input on this issue.
What about providing implementations that run on the GPU, i.e. with openCL or OpenGL compute shaders?
That would be great to have. Is there anything in the rust ecosystem already which would help with this? All I can see from a quick Google is https://github.com/luqmana/rust-opencl.
Not directly related, but you might also want to have a look at the discussion here: https://github.com/PistonDevelopers/imageproc/issues/3.
The work on vulkan might probably help, even if it means going into the unknown. On 26 Feb 2016 13:48, "theotherphil" [email protected] wrote:
That would be great to have. Is there anything in the rust ecosystem already which would help with this? All I can see from a quick Google is https://github.com/luqmana/rust-opencl.
— Reply to this email directly or view it on GitHub https://github.com/PistonDevelopers/imageproc/issues/98#issuecomment-189123392 .
A totally different, possibly terrible, idea is to write a rust front-end for http://halide-lang.org/.
https://github.com/tomaka/glium is a great OpenGL wrapper for rust (I contribute there a lot). It has support for OpenGL compute shaders, which are kind of an alternative to OpenCL.
Compute shaders are an OpenGL 4.3 feature though, so not all drivers support them, but neither do all drivers support OpenCL. So GPU acceleration must be a strictly optional feature.
I also like the vulkan approach a lot. See https://github.com/tomaka/vulkano for that.
Since shaders in vulkan and opengl are basically the same, this will also have to use compute shaders. The big upside of using vulkan is greater control (i.e. with regards to memory)
This sounds like a great feature to have, but it's not something I know anything about. There's already a "performance" branch, which seems like a suitable place for any experiments along these lines.
It sounds like a fun task. I will try to put something together in the next days and maybe have some basic benchmarks to see if it's worth the effort.
Nice, looking forward to seeing it.
Ok, just to give an update on how I've come so far:
Since I'm not willing to run a nightly mesa, I will wait for the 11.3 release to implement a vulkan-based approach. In the meantime I have tinkered with rayon a bit. Turns out the synchronisation overhead causes the pixel iterations to actually slow down when using rayon. Here are the benchmark results when using a parallel canny vs the current (sequential) implementation on a 1250 * 1250 pixel image:
Sequential:
running 1 test
test edges::test::bench_canny ... bench: 578,696,313 ns/iter (+/- 5,821,699)
Parallel:
running 1 test
test edges::test::bench_canny ... bench: 865,148,872 ns/iter (+/- 12,552,160)
I assume this is due to synchronization overhead. With rayon, we have to lock a Mutex (or RwLock) once per pixel. Because usually the work done per pixel is very low, this adds a lot more overhead than the parallel implementation makes it faster.
I will continue doing some research in this area (i.e. splitting an output image into columns and merging them after processing might be a decent approach. Another viable option might be crossbeams lockless data structures, but not sure if they are applicable)
Thanks for the update. Yeah, synchronising per pixel sounds pretty painful. If you get something working with chunks (or columns) we can hopefully nick the idea for a lot of other functions too.
It turns out that most of the time in the canny benchmark is spent in image::imageops::blur. I've been meaning to implement efficient approximate Gaussian blur via iterated box filters for a while and haven't got round to it. However, you can get a big speed up over the imageops version just by calculating a kernel of f32s and doing the separable filter directly. I'll commit this when I get home today.
I'll commit this when I get home today.
Hi, I was just wondering if you ever got ~~home 😅~~ around to this?