cgal icon indicating copy to clipboard operation
cgal copied to clipboard

Unbalanced kdtree

Open afabri opened this issue 3 years ago • 2 comments

Issue Details

When constructing the kdtree for a nearest neighbor search on this dataset we get a stackoverflow as the kdtree becomes highly unbalanced.

Number of items stored: 3182745
Number of nodes: 1475173
 Tree depth: 28458
That is the statistics for the kd-tree
The depth shows how unbalanced it is
The kd-tree offers several split functions, but the default is the best.

If it was balanced it had a depth of ~20

I just found this publication entitled Building a Balanced k-d Tree inO(k n log n)Time which might be a good idea to look at.

Source Code

I've put a test program https://gist.github.com/afabri/6f524e56fa32d8e6ad7b749ee8119c10.

Environment

  • CGAL version: master

afabri avatar Jun 29 '21 11:06 afabri

CC @afabri and @lrineau I have carefully tested this problem. A few remarks:

  1. The cited publication is for integer coordinates only (as far as I understood) and the author uses the long type in the attached to the paper cpp implementation. The integer is used to create so-called super keys, which are used for balancing the tree. CGAL's Kd tree supports floating-point data and in case we decide to implement this paper, we need to find a way how to convert double to long while building the tree or adapt the algorithm for supporting doubles.

  2. The provided data file indeed leads to a stack overflow when building the tree. In fact, we do not need such a complicated file to reproduce this behavior. We can simplify. See the Figures below.

INPUT ZOOM FAILURE
part1 part2 part3

Left: the only part that leads to the highly unbalanced tree is the one in green and red.

Middle: the zoom shows that we have a blue part, which is very sparse, then a dense green part, and the corner in red without neighbors (just two curves). The whole zoomed data set is 42409 points. The green part is 23647 points. And the red part is 1796 points. To fail the test, it is enough to run it on the red part only.

Right: If we look at the red part, we actually find three clusters in orange with duplicate points. Removing them, we end up with 294 points only and a depth of the tree = 28.

Red part stats before removing duplicates:

* num points: 1796
* bucket size: 10
Tree statistics:
Number of items stored: 1796
Number of nodes: 1564
 Tree depth: 1524

Red part stats after removing duplicates:

* num points: 294
* bucket size: 10
* building the tree ... ... finished 
Tree statistics:
Number of items stored: 294
Number of nodes: 65
 Tree depth: 28

It is still pretty unbalanced, but most likely due to multiple collinear points which lie along the two lines.

  1. As we did for red points, we can do for the whole data set. I attach this gist with a small simple function that removes near equal points in the point set. The original point set has 3182745 points and after cleaning we have only 783376 points. The final stats are:
* num points before filtering: 3182745
* num points after filtering: 783376
* bucket size: 10
Tree statistics:
Number of items stored: 783376
Number of nodes: 164430
 Tree depth: 49

The depth is still a bit higher than 20 but here the tree is pretty balanced. The rest of the depth most likely depends on the other problematic parts in the point cloud: like multiple near collinear points with no neighbors around. This particular edge case is actually mentioned in the user manual here. For such cases, the tree can be quite unbalanced.

  1. Finally, the input mesh is quite degenerated. It has multiple zero-area faces and duplicate vertices.

I also attach: the original and clean point clouds + a simple red-points cut that fails here.

danston avatar Jul 09 '21 14:07 danston

I will continue exploring the problem and see if we can find a good splitter that does not care about highly dense clusters of points and maybe behaves better for edge cases with collinear points.

danston avatar Jul 09 '21 14:07 danston