Add OpenMP threading to search
Updated benchmark:
-------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------
OrganizedNeighborSearch 26.0 us 26.0 us 35446
KdTree 52.1 us 52.2 us 10000
KdTreeAll/iterations:1/manual_time 282 us 58906299 us 1 items_per_second=743.246M/s
KdTreeAllThreaded/iterations:1/manual_time 74.4 us 2626656 us 1 items_per_second=2.81404G/s
KdTreeNanoflann 42.6 us 42.5 us 19114
Hi, just a quick question, setting thread number = 1 will not spawn OpenMP worker threads? Like, exactly the same behaviour as before, with OpenMP optinal?
Hi, just a quick question, setting
thread number = 1will not spawn OpenMP worker threads? Like, exactly the same behaviour as before, with OpenMP optinal?
I am not sure if there is a simple yes/no answer. Keeping OpenMP optional in PCL is definitely the goal, you can even opt-out completely while building PCL: https://github.com/PointCloudLibrary/pcl/blob/master/CMakeLists.txt#L309 I would assume that OpenMP implementations only spawn new threads if they are needed, however I do not know what the OpenMP standard dictates and what is implementation defined. If I remember correctly, I have seen that some compilers introduce some kind of "wrapper function" around code parts that are parallellized (even with a single thread), so on the binary level, using OpenMP with a single thread does not seem to be 100% the same as not using OpenMP at all. In the end, it depends on why you are asking?
In the end, it depends on why you are asking?
Just to understand whether by upgrading PCL I would incur in having OpenMP forcibly baked in or not. I recall for NormalEstimation fo example there exist an OMP-suffixed version, and user can choose. In this PR, this does not seem to be the case so I was just wandering the reason for this different approach.
But as you say, building PCL without OpenMP support will just "drop" the OpenMP pragmas, right? So there is a solution to avoid OpenMP, at build stage if I'm not mistaken.
Just to understand whether by upgrading PCL I would incur in having OpenMP forcibly baked in or not. I recall for
NormalEstimationfo example there exist anOMP-suffixed version, and user can choose. In this PR, this does not seem to be the case so I was just wandering the reason for this different approach.
Yes, in the recent past we switched to adding OpenMP to existing classes, instead of having two classes doing the same thing, one using OpenMP and one not (too much code duplication and size bloating in this approach). There have even been some considerations about merging e.g. NormalEstimation and NormalEstimationOMP into one class (which then makes use of OpenMP).
But as you say, building PCL without OpenMP support will just "drop" the OpenMP pragmas, right? So there is a solution to avoid OpenMP, at build stage if I'm not mistaken.
Exactly, by passing -DWITH_OPENMP=OFF to CMake, it will not search for or link to OpenMP, thus the compiler will not "know" the OpenMP pragmas and ignore them.
Thank you very much for clarifications :)
@jmackay2 Out of interest, in the benchmark above, how many threads did you use (what does omp_get_num_procs() return on your computer)? In your results, KdTreeAllThreaded is about 3.8 times faster than KdTreeAll.
@jmackay2 Out of interest, in the benchmark above, how many threads did you use (what does
omp_get_num_procs()return on your computer)? In your results, KdTreeAllThreaded is about 3.8 times faster than KdTreeAll.
Yeah, good question. This benchmark is on 12 threads on my machine.
Yeah, good question. This benchmark is on 12 threads on my machine.
In the past, when parallelizing loops that call nearestKSearch or radiusSearch, we have often observed that the work is not distributed evenly between the threads (when using the default static schedule) because the loop iterations need different amounts of time. See for example https://github.com/PointCloudLibrary/pcl/pull/6155 . If you like, you could try something similar here, but if not, that's fine as well.
In the past, when parallelizing loops that call nearestKSearch or radiusSearch, we have often observed that the work is not distributed evenly between the threads (when using the default static schedule) because the loop iterations need different amounts of time. See for example #6155 . If you like, you could try something similar here, but if not, that's fine as well.
I tested out using a dynamic schedule. It did make things a little faster, but not by a lot. I saw a 10% speed-up from static to dynamic, as least when running the benchmark.