pcl
pcl copied to clipboard
Added parallel implementation of PrincipalCurvaturesEstimation
I was using PrincipalCurvaturesEstimation for a project and noticed that it didn't have an OMP variant like a lot of the other features, and the runtime on this algorithm is fairly slow. I took inspiration from the NormalEstimation -> NormalEstimationOMP comparison. I've tested my implementation on a sample pointcloud, and verified that all curvatures were exactly equal to those calculated using the single-threaded version, and it was calculated in parallel as expected.
I do have a couple questions to just double check with any reviewer to make sure I did this right:
- The original PrincipalCurvaturesEstimation stored most of their calculation variables (covariance matrix, projected normals, etc.) as private class members, when they could have easily been local in scope. I changed this to local variables in my implementation to prevent data races. I just want to make sure there wasn't some subtle reason that I'm missing that those variables would be stored as class members.
- For the reason described above, it would also be very easy to modify the existing PrincipalCurvatuesEstimation class to support parallel execution. Would that be preferable to creating a new class as I've done?
- In my implementation I corrected one or two minor things from the original such as const-correctness. I also switched to using the more expressive
getNormalVector3fMap()
where applicable. I just want to verify that any class with a normal is guaranteed to have that method available.
Hey @alexnavtt
I just notized this comment at the documentation: Note The code is stateful as we do not expect this class to be multicore parallelized. Please look at NormalEstimationOMP for an example on how to extend this to parallel implementations.
I wonder if it returns the same information? I haven't looked deeply into this.
Edit: I guess you made the OMP version non-stateful and it does provides additional information 👍 I missread the original note.
As per having both PrincipalCurvaturesEstimation and a PrincipalCurvaturesEstimationOMP - I think the latest direction was moving the algorithm to be only in the base class and then in the OMP subclass only the methods for setting number of threads etc.
So could the PrincipalCurvaturesEstimation be made non-statefull as well and then only keep a single implementation.
Whats your opinion @mvieth ?
I think it could make sense to just parallelize the existing class, and not add another class for the parallel implementation. The only concern I would have is whether changing computePointPrincipalCurvatures
to use local temporary variables instead of the private class members is slower because it needs to allocate/construct and free the variables repeatedly. A small benchmark might be interesting. An alternative would be to add an overload of computePointPrincipalCurvatures
that also receives the temporary variables (or at least some of them, like projected_normals_
, maybe covariance_matrix_
), which are then local to each thread.
Alright, this all sounds good. I'll run a quick benchmark today or tomorrow to see if it makes any difference (and personally I'm expecting it won't since statically sized Eigen matrices are allocated on the stack) and then I'll move the edits to the base class.
The only remaining question now is whether the parallelized PrincipalCurvaturesEstimation should have default thread number of 0 or 1. I'm leaning towards leaving it at 1 to maintain API stability and people can manually change that if they want parallelization.
Alright, this all sounds good. I'll run a quick benchmark today or tomorrow to see if it makes any difference (and personally I'm expecting it won't since statically sized Eigen matrices are allocated on the stack)
That is true, but I could imagine that it might make a difference due to the projected_normals_
vector
The only remaining question now is whether the parallelized PrincipalCurvaturesEstimation should have default thread number of 0 or 1. I'm leaning towards leaving it at 1 to maintain API stability and people can manually change that if they want parallelization.
I am fine with either since the output is the same, and I would not consider parallelization a breaking change. But if you want 1 to be the default, that is also okay.
I've migrated the changes to the base class. Benchmarking showed about a 1% slowdown using local variables compared to storing them as class members, but that seems like a fair trade to me with the option for parallelization. I kept the default thread number at 1.
The above is only suggestions as we try to move towards https://clang.llvm.org/extra/clang-tidy/checks/modernize/use-default-member-init.html