imageproc
imageproc copied to clipboard
Performance
It's currently pretty poor.
Benchmarking:
- Make sure we have a sensible suite of benchmarks for all functions.
- Write a tool to profile vs a selection of standard libraries (vlfeat, opencv, etc.).
Profiling and improving existing functions:
- Do some!
- Understand the cost of Rust operations/what LLVM can optimise away. e.g as well as being syntactically horrific, something like this looks expensive: (*out.get_pixel_mut(x, y))[0] += out.get_pixel(x, y - 1)[0]; I think that in theory this should all be optimised away, but maybe it's not.
- Bounds checking, save vs unsafe. If we're doing something other than just iterate through all pixels in order then presumably eliding bounds checking is a lot trickier. Do we want to use unsafe everywhere? That seems a bit... unsafe.
- Is there any benefit in exposing raw scanlines via a DirectImage trait (as exposed to GenericImage that just allows access to single pixels)?
- Do we need any more abstractions for iterating over pixels/regions to let us get performance benefits safely?
- Actually do the work to make all the existing functions reasonably performant.
Idioms/best practises:
- Write up what we discovered from the profiling. Clearly document which operations are expensive, and how to avoid common pitfalls.
@tafia has been making a lot of large performance improvements recently. It might eventually be nice to write a short performance tutorial/discussion working roughly through the commit history of e.g. one of the filtering functions.
Always working with GenericImage
may not be the best strategy for performance.
Knowing we work with a VecBuffer
(as stated on readme non-goals) instead would allow us
- to use iterators => no need for unsafe in many places
- no need to copy GenericImage into ImageBuffer (like here)
- eventually allow implementing more iterators instead of new
VecBuffer
s
@theotherphil Is this something I can try working on or you don't want to go in that direction ?
I'm conflicted on this. My initial not-really-a-plan was just to wait until specialisation landed in the compiler and have faster versions of functions specialised to images that we know consist of contiguous blocks of memory (i.e. that implemented some ReallyAChunkOfMemory trait).
I'm somewhat reluctant to just switch to requiring ImageBuffers everywhere because this precludes any possible clever future ideas involving lazy images. It also potentially stops anyone using this library with their own image types without first copying all their data. However, maybe it's reasonable to say that ImageBuffers are the only way to efficiently use the library and if you don't care that much about performance you can just implement to/from buffer for your own image types.
I guess we can always change our minds later - we need to be very explicit that we're not guaranteeing stability of any APIs for the medium term future, so that we can switch around our requirements as we find out what does or doesn't work.
For maximum performance we'd probably also want to control image strides and alignment, which I don't think ImageBuffer currently allows.
Ok, there wasn't anything very concrete there, just rambling about vague concerns.
So, more concretely:
- I'm not convinced that standardising on ImageBuffer is a great idea. The best structure for image encoding/decoding might not be the best for image processing, and I want to keep the ability to vary our requirements somewhat independently of the concrete types in the image library.
- However, allowing for arbitrary GenericImages does impose a huge performance restriction, as you say. I'd be happy for us to start adding extra restrictions on functions for increased performance. The best approach might be to create our own image traits with additional requirements, e.g. ability to provide a row slice, or providing raw access to underlying data, etc.
Does this sound reasonable? This is more work than just switching to ImageBuffer everywhere, and it's not me offering to do the work, so it seems a little unfair to just say "no, do something harder instead".
The best approach might be to create our own image traits with additional requirements, e.g. ability to provide a row slice, or providing raw access to underlying data, etc.
This shouldn't be too complicated, I like it.
Ok, great.
Anything that made functions faster on ImageBuffers would be directly useful for my hypothetical new world of custom traits, and getting the API for these traits right will involve having some experience of trying to write optimised functions for buffer-like images. So it's probably worth maintaining an ImageBuffer branch for a while to let us experiment with optimisations that are enabled by restricting everything to ImageBuffers and get a better idea of what our new traits should look like.
I've just created a "performance" branch for this.
For future reference, here are all the current benchmarks according to my laptop (ignore the "baseline" ones - they're to let us make sure that whatever we add is faster than what's already in image::imagops).
test affine::test::bench_rotate_bilinear ... bench: 402,929 ns/iter (+/- 12,354)
test affine::test::bench_rotate_nearest ... bench: 319,127 ns/iter (+/- 52,885)
test affine::test::bench_translate ... bench: 211,600 ns/iter (+/- 12,841)
test contrast::test::bench_equalize_histogram ... bench: 2,236,089 ns/iter (+/- 224,182)
test corners::test::bench_is_corner_fast12_12_noncontiguous ... bench: 21 ns/iter (+/- 2)
test corners::test::bench_is_corner_fast9_9_contiguous_lighter_pixels ... bench: 21 ns/iter (+/- 2)
test filter::test::bench_baseline_gaussian_stdev_1 ... bench: 1,720,996 ns/iter (+/- 147,949)
test filter::test::bench_baseline_gaussian_stdev_10 ... bench: 12,492,048 ns/iter (+/- 518,286)
test filter::test::bench_baseline_gaussian_stdev_3 ... bench: 4,335,076 ns/iter (+/- 317,193)
test filter::test::bench_box_filter ... bench: 2,813,485 ns/iter (+/- 156,651)
test filter::test::bench_filter3x3_i32_filter ... bench: 1,988,750 ns/iter (+/- 87,241)
test filter::test::bench_horizontal_filter ... bench: 3,051,673 ns/iter (+/- 227,131)
test filter::test::bench_separable_filter ... bench: 2,251,522 ns/iter (+/- 538,489)
test filter::test::bench_vertical_filter ... bench: 3,172,381 ns/iter (+/- 142,535)
test haar::test::bench_evaluate_all_filters_10x10 ... bench: 1,919,664 ns/iter (+/- 456,883)
test integralimage::test::bench_column_running_sum ... bench: 778 ns/iter (+/- 123)
test integralimage::test::bench_integral_image ... bench: 443,570 ns/iter (+/- 63,085)
test integralimage::test::bench_row_running_sum ... bench: 596 ns/iter (+/- 51)
test regionlabelling::test::bench_connected_components_eight_chessboard ... bench: 923,307 ns/iter (+/- 112,500)
test regionlabelling::test::bench_connected_components_four_chessboard ... bench: 472,306 ns/iter (+/- 147,901)
test suppress::test::bench_local_maxima_dense ... bench: 27,985 ns/iter (+/- 5,060)
test suppress::test::bench_local_maxima_sparse ... bench: 36,700 ns/iter (+/- 13,368)
test suppress::test::bench_suppress_non_maximum_decreasing_gradient ... bench: 1,297 ns/iter (+/- 642)
test suppress::test::bench_suppress_non_maximum_increasing_gradient ... bench: 1,759 ns/iter (+/- 329)
test suppress::test::bench_suppress_non_maximum_noise_1 ... bench: 5,255 ns/iter (+/- 1,382)
test suppress::test::bench_suppress_non_maximum_noise_3 ... bench: 3,218 ns/iter (+/- 2,241)
test suppress::test::bench_suppress_non_maximum_noise_7 ... bench: 2,284 ns/iter (+/- 311)
Just read about http://bluss.github.io/rust-ndarray/master/ndarray/index.html on reddit. It might help when working with kernels etc ...
It looks good. I've not had the time to work on this library much recently but I've just started reading up on halide a little and think we could nick some of their ideas. I doubt we'll ever be as fast as them, but it would be interesting to see how far we could get towards allowing configurable pipelines of functions by combining eager and static evaluation of iterators of image chunks. This is obviously ridiculously vague - I've not given serious thought to how exactly we'd do this, but I think we definitely need to move away from functions that force the creation of intermediates at all steps. The n-d split_at support in ndarray looks really promising for this.
https://github.com/libmir/dcv This D library is based heavily on the ndslice type, so might provide some useful ideas.
I have a question: is this crate based on CPU computations? And if yes, would it be interesting to add CUDA and OpenCL support?
Yes, everything is on the CPU at the moment. It certainly would be interesting to add GPU support - do you have any ideas for how we might go about doing this?
Yes, actually!
There are a few crates out there that offer OpenCL and CUDA support. We'd have to research them. However, the core of the idea is PIMPL.
Basically we define a bunch of traits, instead of structs. Example:
trait VerticalSobel {
fn vertical_sobel(&self, image: &GrayImage) -> Image<Luma<i16>>;
}
Then we'll have the different implementors:
struct CPUImpl { ... }
struct CUDAImpl { ... }
Which implement the traits:
impl VerticalSobel for CPUImpl {
fn vertical_sobel(&self, image: &GrayImage) -> Image<Luma<i16>> {
// Use CPU to perform calculations
}
}
impl VerticalSobel for CUDAImpl {
fn vertical_sobel(&self, image: &GrayImage) -> Image<Luma<i16>> {
// Use CUDA to perform calculations
}
}
And to use them, we'd have something like
struct Gradients<T> {
implementation: T,
...
}
impl<T: VerticalSobel> VerticalSobel for Gradients<T> {
fn vertical_sobel(&self, image: &GrayImage) -> Image<Luma<i16>> {
self.implementation.vertical_sobel(self.image)
}
}
That way people would have a stable API to use in their code, and then we could setup at compile time to use CUDA/OpenCL if it's available, or fall back to the CPU implementation if not.
I'm afraid I don't follow your examples. Could you elaborate on why there's an implementation of VerticalSobel for Gradients here, and on what this would look like for the eventual user?
Gradients
doesn't need to have an implementation of VerticalSobel
. It's just a possibility.
Actually, my example was kinda bad thought out. I think it would be better to have a Processor
struct.
struct Processor<T> {
image: GrayImage,
implementation: T,
...
}
impl<T: VerticalSobel> Processor<T> {
fn vertical_sobel(&self, &GrayImage) -> Image<Luma<i16>> {
self.implementation.vertical_sobel(self.image)
}
}
As for the end user, it could either look like this:
let img: GrayImage = get_gray_image();
let processor = Processor::new_with_CUDA();
let vertical_sobel = processor.vertical_sobel(&img);
Or this:
let img: GrayImage = get_gray_image();
let processor = Processor::new_with_CPU(img);
let vertical_sobel = processor.vertical_sobel();
That way all image processing would be handled via this Processor
struct which has a user-chosen implementation of the algorithms.
This sounds like an interesting approach. Before making sweeping API changes we should have a few GPU implementations in place to experiment with (initially just defined as separate functions). However, I'm certainly open to switching to something like this if it works out.
Googling around for GPU code with Rust, there seems to be a few crates already:
https://github.com/arrayfire/arrayfire-rust https://github.com/autumnai/rust-cudnn https://github.com/cogciprocate/ocl
We just need to pick a good one to test out. That arrayfire one looks promising.
I agree! Are you planning on trying this out yourself?
I'd love to, but I can't promise anything. College is a huge time drain, unfortunately. But hey, the idea is there, so maybe someone else can get interested in it?
Ok, thanks for the suggestions!