pcl icon indicating copy to clipboard operation
pcl copied to clipboard

[registration] Add OMP based Multi-threading to SampleConsensusPrerejective

Open koide3 opened this issue 5 years ago • 18 comments

This PR adds OMP-based multi-threading capability to SampleConsensusPrerejective.

Changes are as follows:

  • const qualifiers are added to several methods to clarify that they are thread-safe
  • getFitness() is modified such that it uses per-thread transformation variable
  • similar_features are precomputed for concurrency

koide3 avatar Oct 05 '20 02:10 koide3

@koide3 Could you please tag the PR appropriately?

PS: Failures on Mac

kunaltyagi avatar Oct 05 '20 04:10 kunaltyagi

@kunaltyagi It seems I don't have permission to edit tags... I'll take a look at the errors on Mac later.

koide3 avatar Oct 05 '20 05:10 koide3

I did rebase and squash to resolve the conflict.

koide3 avatar Nov 03 '20 13:11 koide3

@SunBlack could you please take a look at the OMP pragma sites?

kunaltyagi avatar Nov 03 '20 16:11 kunaltyagi

@SunBlack could you please take a look at the OMP pragma sites?

Took a while until both compilers build the branch.

@kunaltyagi After removing Ubuntu 16.04 from Azure: do we still have a GCC 6-8 anywhere, so we can still be sure to have OPENMP_LEGACY_CONST_DATA_SHARING_RULE where necessary?

@koide3 Are you are sure about the usage of firstprivate? A copy of the vectors is made for each thread.

SunBlack avatar Nov 04 '20 16:11 SunBlack

do we still have a GCC 6-8 anywhere?

Yes, 18.04 has GCC 7 on it (though we force install GCC 8 on it for now due to ... reasons)

kunaltyagi avatar Nov 04 '20 21:11 kunaltyagi

@SunBlack

Are you are sure about the usage of firstprivate? A copy of the vectors is made for each thread.

Yes, one copy of the them must be created for each thread.

koide3 avatar Nov 05 '20 08:11 koide3

Do you have an estimate which parts of computeTransform take the most time? I am asking to determine whether it really makes sense to parallelize both for loops like you did, or if perhaps only the parallelization of the first loop is good because the second loop is fast anyway and the overhead of the parallelization is too large - just a thought. Also it would be interesting to see how large the speedup actually is, e.g. if you run the openmp version with 2 or 4 threads, how much faster is it than running it with one thread? That could give hints whether the openmp implementation still has opportunities for optimization.

mvieth avatar Nov 17 '20 19:11 mvieth

Here is a brief benchmark result. The second for loop, which is the main loop of RANSAC, is the most expensive part in this algorithm, while the first for loop is not so heavy. It seems the overhead of omp is not so large in this case, and both the loops get much faster with omp.

15950 pts vs 15772 pts (FPFH)

with OMP
threads   1st[msec] 2nd[msec]  total[msec]
1         261.082   22999.672  23260.755
2         118.717   12271.585  12390.303
3          79.286    8156.884   8236.170
4          63.650    7475.746   7539.397
5          53.186    5391.872   5445.058
6          44.846    4408.987   4453.833
7          43.878    3903.800   3947.678
8          36.810    4014.925   4051.735
9          33.360    3960.413   3993.773

without OMP
1          263.646  23792.844  24056.490
1          226.914  24037.704  24264.618
1          226.266  22705.109  22931.375
1          221.195  25475.866  25697.061

koide3 avatar Nov 18 '20 02:11 koide3

Can I do rebase and force-push to resolve the conflicts?

koide3 avatar Nov 19 '20 08:11 koide3

Conflicts are resolved.

koide3 avatar Nov 24 '20 06:11 koide3

@JStech @mvieth Please resolve the conversations that have reached conclusion. It becomes a bit difficult to see if there's work left

kunaltyagi avatar Nov 24 '20 11:11 kunaltyagi

Somehow my comment threads don't have "Resolve" buttons--maybe because I made them comments and not change requests in the first place? I do consider them all resolved.

JStech avatar Nov 25 '20 15:11 JStech

I checked again and the drawing of the random samples (selectSamples) should not be done in parallel (calls to rand()). That has to either be wrapped in omp critical, or maybe a better approach would be to make the for-loop in getFitness parallel and keep the main for-loop sequential. That would be safer, but I haven't checked yet how that would compare in terms of speed-up.

mvieth avatar Aug 23 '22 13:08 mvieth

I checked again and the drawing of the random samples (selectSamples) should not be done in parallel (calls to rand()). That has to either be wrapped in omp critical, or maybe a better approach would be to make the for-loop in getFitness parallel and keep the main for-loop sequential. That would be safer, but I haven't checked yet how that would compare in terms of speed-up.

Ahh, yes that seems to cause problems. Atleast its not as random as it should be and according to some reading, can lead to UB, when calling rand() in multithreaded manner.

larshg avatar Aug 24 '22 04:08 larshg

I guess the estimateRigidTransformation is also somewhat compute intensive, so the better option is to guard the random sampling?

larshg avatar Aug 24 '22 06:08 larshg

I guess the estimateRigidTransformation is also somewhat compute intensive, so the better option is to guard the random sampling?

I can't confirm that: In my tests, less than 1% of the time of computeTransformation was spent in estimateRigidTransformation. Almost all of the time was spent in getFitness and finding similar features. Another reason to not parallelize the main loop: currently, the loop runs exactly max_iterations_ times, but it could make sense to stop earlier if a good solution has been found. With a parallel main loop, that would be more difficult to implement. I will check how well getFitness can be parallelized.

mvieth avatar Aug 24 '22 18:08 mvieth

Yeah okay, that doesn't sound like much. Lets see what your tests shows 👍

larshg avatar Aug 24 '22 18:08 larshg