pcl
pcl copied to clipboard
[registration] Add OMP based Multi-threading to SampleConsensusPrerejective
This PR adds OMP-based multi-threading capability to SampleConsensusPrerejective.
Changes are as follows:
- const qualifiers are added to several methods to clarify that they are thread-safe
getFitness()is modified such that it uses per-threadtransformationvariablesimilar_featuresare precomputed for concurrency
@koide3 Could you please tag the PR appropriately?
PS: Failures on Mac
@kunaltyagi It seems I don't have permission to edit tags... I'll take a look at the errors on Mac later.
I did rebase and squash to resolve the conflict.
@SunBlack could you please take a look at the OMP pragma sites?
@SunBlack could you please take a look at the OMP pragma sites?
Took a while until both compilers build the branch.
@kunaltyagi After removing Ubuntu 16.04 from Azure: do we still have a GCC 6-8 anywhere, so we can still be sure to have OPENMP_LEGACY_CONST_DATA_SHARING_RULE where necessary?
@koide3 Are you are sure about the usage of firstprivate? A copy of the vectors is made for each thread.
do we still have a GCC 6-8 anywhere?
Yes, 18.04 has GCC 7 on it (though we force install GCC 8 on it for now due to ... reasons)
@SunBlack
Are you are sure about the usage of firstprivate? A copy of the vectors is made for each thread.
Yes, one copy of the them must be created for each thread.
Do you have an estimate which parts of computeTransform take the most time? I am asking to determine whether it really makes sense to parallelize both for loops like you did, or if perhaps only the parallelization of the first loop is good because the second loop is fast anyway and the overhead of the parallelization is too large - just a thought.
Also it would be interesting to see how large the speedup actually is, e.g. if you run the openmp version with 2 or 4 threads, how much faster is it than running it with one thread? That could give hints whether the openmp implementation still has opportunities for optimization.
Here is a brief benchmark result. The second for loop, which is the main loop of RANSAC, is the most expensive part in this algorithm, while the first for loop is not so heavy. It seems the overhead of omp is not so large in this case, and both the loops get much faster with omp.
15950 pts vs 15772 pts (FPFH)
with OMP
threads 1st[msec] 2nd[msec] total[msec]
1 261.082 22999.672 23260.755
2 118.717 12271.585 12390.303
3 79.286 8156.884 8236.170
4 63.650 7475.746 7539.397
5 53.186 5391.872 5445.058
6 44.846 4408.987 4453.833
7 43.878 3903.800 3947.678
8 36.810 4014.925 4051.735
9 33.360 3960.413 3993.773
without OMP
1 263.646 23792.844 24056.490
1 226.914 24037.704 24264.618
1 226.266 22705.109 22931.375
1 221.195 25475.866 25697.061
Can I do rebase and force-push to resolve the conflicts?
Conflicts are resolved.
@JStech @mvieth Please resolve the conversations that have reached conclusion. It becomes a bit difficult to see if there's work left
Somehow my comment threads don't have "Resolve" buttons--maybe because I made them comments and not change requests in the first place? I do consider them all resolved.
I checked again and the drawing of the random samples (selectSamples) should not be done in parallel (calls to rand()). That has to either be wrapped in omp critical, or maybe a better approach would be to make the for-loop in getFitness parallel and keep the main for-loop sequential. That would be safer, but I haven't checked yet how that would compare in terms of speed-up.
I checked again and the drawing of the random samples (
selectSamples) should not be done in parallel (calls torand()). That has to either be wrapped in omp critical, or maybe a better approach would be to make the for-loop ingetFitnessparallel and keep the main for-loop sequential. That would be safer, but I haven't checked yet how that would compare in terms of speed-up.
Ahh, yes that seems to cause problems. Atleast its not as random as it should be and according to some reading, can lead to UB, when calling rand() in multithreaded manner.
I guess the estimateRigidTransformation is also somewhat compute intensive, so the better option is to guard the random sampling?
I guess the
estimateRigidTransformationis also somewhat compute intensive, so the better option is to guard the random sampling?
I can't confirm that: In my tests, less than 1% of the time of computeTransformation was spent in estimateRigidTransformation. Almost all of the time was spent in getFitness and finding similar features. Another reason to not parallelize the main loop: currently, the loop runs exactly max_iterations_ times, but it could make sense to stop earlier if a good solution has been found. With a parallel main loop, that would be more difficult to implement.
I will check how well getFitness can be parallelized.
Yeah okay, that doesn't sound like much. Lets see what your tests shows 👍