nanoflann icon indicating copy to clipboard operation
nanoflann copied to clipboard

Parallel KD-Tree construction

Open B1ueber2y opened this issue 2 years ago • 12 comments

Hey thanks for the great project! I am wondering if there is any easy way (or forked project) to enable parallelism for KD-Tree construction with nanoflann? Or do you have plans to integrate this feature in the future?

B1ueber2y avatar Feb 11 '22 23:02 B1ueber2y

What is the MatrixType & coeff?

AlphaHot avatar Apr 23 '22 14:04 AlphaHot

Hey, I forked the project and made a commit on building nanoflann::KDTreeBaseClass concurrently, which works by adding an extra parameter in KDTreeSingleIndexAdaptorParams (by default it is 1). In my tests, concurrent builds with 8-16 threads can improve building performance by up to around 3x. See docs about the parameter: https://github.com/Mintpasokon/nanoflann-concurrent-build/tree/concurrent_build#23-kdtreesingleindexadaptorparamsn_thread_build I was wondering if I could propose a PR for the project. Please let me know if you have any suggestions or expectations.@jlblancoc

Mintpasokon avatar Mar 29 '23 07:03 Mintpasokon

I totally love your changes and how you approached it :+1:

I checked it with valgrind and gcc sanitizer and both are happy, so I am. Well, actually valgrind --tool=helgrind tests/unit_tests complains about "potential" data races, but I think they are false positives due to the use of atomics. Take a look at it (build with "-g" for debug symbols) if you want first.

Then, yes, feel free of opening a PR, after adding the corresponding line (and your name at the end?) to the CHANGELOG.

Thanks!

jlblancoc avatar Mar 30 '23 21:03 jlblancoc

Thank you for your reply! I have submitted a PR with some explanation about helgrind. If you have any suggestions or concerns, please let me know. Thanks again!

Mintpasokon avatar Mar 31 '23 09:03 Mintpasokon

Closing, this feature was already merged and is released in the latest v1.5.0

jlblancoc avatar Jun 15 '23 23:06 jlblancoc

@Mintpasokon Can you share a bit of details on the dataset/parameters you used to achieve a speedup of 3? I was trying for my application and the result were far off that (incl. being slower at very large thread counts on an HPC node). I would love to understand whether there is room for improvement - sequential kdtree construction has become a bottleneck in our the application...

dokempf avatar Jun 22 '23 07:06 dokempf

@dokempf I tested with Ryzen 7 5800X 8c/16t, DDR4 3200, with 1~20 millon 2D points to get 3x speedup. As you are using large amount of threads, I guess there might be some problem with memory access locality, or false sharing, in this line, since the datas in vAcc are random, in worst case, it would require much larger CPU cache for your dataset than actually accessed. It will probably works by

  1. limit thread count, maybe to 8 or 16.
  2. change behavior of nanoflann, maybe make a copy of origin dataset and actually move datas in dataset rather than indexs in vAcc when building kd tree(in planeSplit()), which may help with locality. I have no access to my PC now, and I will try the second way when avaliable. Good luck!

Mintpasokon avatar Jun 22 '23 14:06 Mintpasokon

@dokempf I made some perf tests around number of points and thread count in kd tree building. It looks like this: image It seems that concurrent builds have a performance advantage when there are at least 50,000 points in the dataset. And there is a linear relationship between the ratio of speedup and the logarithm of points. What is the scale of your problem? It may be related.

Mintpasokon avatar Jun 25 '23 10:06 Mintpasokon

Thanks for the benchmarking, @Mintpasokon !! If you feel like that, it would be great if you wanted to do a Pull Request to contribute your benchmark to the nanoflann benchmark repo... just in case we have time someday to put everything there in order and get a decent report :-) In that case, feel free of adding your name and/or affiliation to your files, of course.

jlblancoc avatar Jun 25 '23 22:06 jlblancoc

The scale of my problems is much larger. I tested datasets of 100k-400M 3D double precision points. The smallest of these do still fit in L3 - still I only see speed-ups of up to 1.3 - if any.

dokempf avatar Jun 26 '23 08:06 dokempf

@dokempf Probably worth investigating an alternative implementation for such larger datasets with T threads:

  1. Split the input dataset in T clusters (to exploit locality of data, just split into T consecutive blocks)
  2. Run the "classic" single-thread method for each cluster.
  3. Merge the resulting tree nodes.

Overall, for n points, and T threads, the expected average time complexity (if I'm right... just a quick draft):

original: O(n log n) 
parallel: O( n/T · log(n/T)  + T · n/T · log(n/T) ) = O(T · n/T · log(n/T)) = O( n log(n/T) )

Of course, this is just the expected average, and the constants (obviated above) are key in real implementations, but in theory we could obtain some gain.

PS: I would love to, but I don't have bandwidth at present to try to implement and benchmark it, tough... :-(

jlblancoc avatar Jun 26 '23 10:06 jlblancoc

PS: @dokempf if your data is not too sparse as to memory to become a game stopper, you could try a grid-based representation instead, e.g. here I have an implementation

jlblancoc avatar Jun 26 '23 10:06 jlblancoc