cgal
cgal copied to clipboard
Unbalanced kdtree
Issue Details
When constructing the kdtree for a nearest neighbor search on this dataset we get a stackoverflow as the kdtree becomes highly unbalanced.
Number of items stored: 3182745
Number of nodes: 1475173
Tree depth: 28458
That is the statistics for the kd-tree
The depth shows how unbalanced it is
The kd-tree offers several split functions, but the default is the best.
If it was balanced it had a depth of ~20
I just found this publication entitled Building a Balanced k-d Tree inO(k n log n)Time which might be a good idea to look at.
Source Code
I've put a test program https://gist.github.com/afabri/6f524e56fa32d8e6ad7b749ee8119c10.
Environment
- CGAL version: master
CC @afabri and @lrineau I have carefully tested this problem. A few remarks:
-
The cited publication is for integer coordinates only (as far as I understood) and the author uses the
long
type in the attached to the papercpp
implementation. The integer is used to create so-called super keys, which are used for balancing the tree. CGAL's Kd tree supports floating-point data and in case we decide to implement this paper, we need to find a way how to convert double to long while building the tree or adapt the algorithm for supporting doubles. -
The provided data file indeed leads to a stack overflow when building the tree. In fact, we do not need such a complicated file to reproduce this behavior. We can simplify. See the Figures below.
INPUT | ZOOM | FAILURE |
---|---|---|
![]() |
![]() |
![]() |
Left: the only part that leads to the highly unbalanced tree is the one in green and red.
Middle: the zoom shows that we have a blue part, which is very sparse, then a dense green part, and the corner in red without neighbors (just two curves). The whole zoomed data set is 42409 points. The green part is 23647 points. And the red part is 1796 points. To fail the test, it is enough to run it on the red part only.
Right: If we look at the red part, we actually find three clusters in orange with duplicate points. Removing them, we end up with 294 points only and a depth of the tree = 28.
Red part stats before removing duplicates:
* num points: 1796
* bucket size: 10
Tree statistics:
Number of items stored: 1796
Number of nodes: 1564
Tree depth: 1524
Red part stats after removing duplicates:
* num points: 294
* bucket size: 10
* building the tree ... ... finished
Tree statistics:
Number of items stored: 294
Number of nodes: 65
Tree depth: 28
It is still pretty unbalanced, but most likely due to multiple collinear points which lie along the two lines.
- As we did for red points, we can do for the whole data set. I attach this gist with a small simple function that removes near equal points in the point set. The original point set has 3182745 points and after cleaning we have only 783376 points. The final stats are:
* num points before filtering: 3182745
* num points after filtering: 783376
* bucket size: 10
Tree statistics:
Number of items stored: 783376
Number of nodes: 164430
Tree depth: 49
The depth is still a bit higher than 20 but here the tree is pretty balanced. The rest of the depth most likely depends on the other problematic parts in the point cloud: like multiple near collinear points with no neighbors around. This particular edge case is actually mentioned in the user manual here. For such cases, the tree can be quite unbalanced.
- Finally, the input mesh is quite degenerated. It has multiple zero-area faces and duplicate vertices.
I also attach: the original and clean point clouds + a simple red-points cut that fails here.
I will continue exploring the problem and see if we can find a good splitter that does not care about highly dense clusters of points and maybe behaves better for edge cases with collinear points.