pcl Add OpenMP threading to search

Updated benchmark:

-------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations
-------------------------------------------------------------------------------------
OrganizedNeighborSearch                          26.0 us         26.0 us        35446
KdTree                                           52.1 us         52.2 us        10000
KdTreeAll/iterations:1/manual_time                282 us     58906299 us            1 items_per_second=743.246M/s
KdTreeAllThreaded/iterations:1/manual_time       74.4 us      2626656 us            1 items_per_second=2.81404G/s
KdTreeNanoflann                                  42.6 us         42.5 us        19114

May 30 '25 04:05 jmackay2

Hi, just a quick question, setting thread number = 1 will not spawn OpenMP worker threads? Like, exactly the same behaviour as before, with OpenMP optinal?

May 30 '25 09:05 roncapat

Hi, just a quick question, setting thread number = 1 will not spawn OpenMP worker threads? Like, exactly the same behaviour as before, with OpenMP optinal?

I am not sure if there is a simple yes/no answer. Keeping OpenMP optional in PCL is definitely the goal, you can even opt-out completely while building PCL: https://github.com/PointCloudLibrary/pcl/blob/master/CMakeLists.txt#L309 I would assume that OpenMP implementations only spawn new threads if they are needed, however I do not know what the OpenMP standard dictates and what is implementation defined. If I remember correctly, I have seen that some compilers introduce some kind of "wrapper function" around code parts that are parallellized (even with a single thread), so on the binary level, using OpenMP with a single thread does not seem to be 100% the same as not using OpenMP at all. In the end, it depends on why you are asking?

May 30 '25 09:05 mvieth

In the end, it depends on why you are asking?

Just to understand whether by upgrading PCL I would incur in having OpenMP forcibly baked in or not. I recall for NormalEstimation fo example there exist an OMP-suffixed version, and user can choose. In this PR, this does not seem to be the case so I was just wandering the reason for this different approach.

May 30 '25 10:05 roncapat

But as you say, building PCL without OpenMP support will just "drop" the OpenMP pragmas, right? So there is a solution to avoid OpenMP, at build stage if I'm not mistaken.

May 30 '25 10:05 roncapat

Just to understand whether by upgrading PCL I would incur in having OpenMP forcibly baked in or not. I recall for NormalEstimation fo example there exist an OMP-suffixed version, and user can choose. In this PR, this does not seem to be the case so I was just wandering the reason for this different approach.

Yes, in the recent past we switched to adding OpenMP to existing classes, instead of having two classes doing the same thing, one using OpenMP and one not (too much code duplication and size bloating in this approach). There have even been some considerations about merging e.g. NormalEstimation and NormalEstimationOMP into one class (which then makes use of OpenMP).

But as you say, building PCL without OpenMP support will just "drop" the OpenMP pragmas, right? So there is a solution to avoid OpenMP, at build stage if I'm not mistaken.

Exactly, by passing -DWITH_OPENMP=OFF to CMake, it will not search for or link to OpenMP, thus the compiler will not "know" the OpenMP pragmas and ignore them.

May 30 '25 19:05 mvieth

Thank you very much for clarifications :)

May 30 '25 20:05 roncapat

@jmackay2 Out of interest, in the benchmark above, how many threads did you use (what does omp_get_num_procs() return on your computer)? In your results, KdTreeAllThreaded is about 3.8 times faster than KdTreeAll.

Jun 03 '25 13:06 mvieth

@jmackay2 Out of interest, in the benchmark above, how many threads did you use (what does omp_get_num_procs() return on your computer)? In your results, KdTreeAllThreaded is about 3.8 times faster than KdTreeAll.

Yeah, good question. This benchmark is on 12 threads on my machine.

Jun 04 '25 05:06 jmackay2

Yeah, good question. This benchmark is on 12 threads on my machine.

In the past, when parallelizing loops that call nearestKSearch or radiusSearch, we have often observed that the work is not distributed evenly between the threads (when using the default static schedule) because the loop iterations need different amounts of time. See for example https://github.com/PointCloudLibrary/pcl/pull/6155 . If you like, you could try something similar here, but if not, that's fine as well.

Jun 04 '25 08:06 mvieth

In the past, when parallelizing loops that call nearestKSearch or radiusSearch, we have often observed that the work is not distributed evenly between the threads (when using the default static schedule) because the loop iterations need different amounts of time. See for example #6155 . If you like, you could try something similar here, but if not, that's fine as well.

I tested out using a dynamic schedule. It did make things a little faster, but not by a lot. I saw a 10% speed-up from static to dynamic, as least when running the benchmark.

Jun 07 '25 03:06 jmackay2