ArborX Improve traversal algorithm when interested only in pairs as a result

Just recording some thoughts.

The halo finding algorithm using union-find that we have right now has the callback that proceeds only if i < j so as to process any edge only once. The thing to notice here is this may as well be replaced by permuted(i) < permuted(j), where permuted(i) is the application of Morton permutation.

The implication of this is that we could skip traversal of the whole subtrees. If we use Apetrei ordering, for example, then we know that all nodes in the left subtree have index less or equal to that the index of the internal node. Thus, if the index of the internal node is lower than that of the query (assuming that queries are permuted using the same permutation as leaf nodes), we don't need to traverse that subtree. In a similar fashion, it should also be possible for Karras ordering.

Potential use cases for pairs:

2-points correlation algorithm
Half-neighbor lists in MD
Inline force computation in MD

It may also be worthwhile to examine papers on collision problems in computer graphics, as they share some of the same features.

Oct 06 '20 00:10 aprokop

When using ropes, the solution may be pretty elegant by starting not from the root, but from an internal node or leaf node with a specific index.

For Apetrei's ordering, it would be the right child of the internal node with index queryIndex.

For Karras ordering, it is more tricky, and depends whether the Karras internal node with queryIndex index is associated with left or right boundary of its range. For the left boundary, start with this internal node. For the right boundary, need start with the right child of the parent, i.e., follow the rope. Unfortunately, the info required to determine whether it is a left or right boundary is unavailable during the traversal.

Oct 06 '20 01:10 aprokop

Out of curiosity and to see whether this should be explored further, I hacked ropes spatial traversal and halo finding callback (the diffs are to master 5222818, the branch is here). The hack still starts from the root, but bypasses subtrees with index smaller than the query index.

--- a/src/details/ArborX_DetailsTreeTraversal.hpp
+++ b/src/details/ArborX_DetailsTreeTraversal.hpp
@@ -139,6 +139,8 @@ struct TreeTraversal<BVH, Predicates, Callback, SpatialPredicateTag>
   {
     auto const &predicate = Access::get(predicates_, queryIndex);
 
+    int leaf_nodes_shift = bvh_.size() - 1;
+
     Node const *node;
     int next = 0; // start with root
     do
@@ -149,10 +151,25 @@ struct TreeTraversal<BVH, Predicates, Callback, SpatialPredicateTag>
       {
         if (!node->isLeaf())
         {
-          next = node->left_child;
+          int left_child_index = node->left_child;
+          if (left_child_index > queryIndex)
+          {
+            // Only follow the left child if it is internal node with
+            // index >= queryIndex, or if it's a leaf node
+            next = left_child_index;
+          }
+          else
+          {
+            // Only follow the right child if it is internal node with
+            // index >= queryIndex, or if it's a leaf node
+            auto *left_child = bvh_.getNodePtr(left_child_index);
+            int right_child_index = left_child->rope;
+            next = right_child_index;
+          }
         }
         else
         {
+          if (next - leaf_nodes_shift > queryIndex)
             callback_(predicate, node->getLeafPermutationIndex());
           next = node->rope;
         }
--- a/examples/halo_finder/ArborX_HaloFinder.hpp
+++ b/examples/halo_finder/ArborX_HaloFinder.hpp
@@ -246,13 +246,6 @@ struct CCSCallback
   {
     int const i = ArborX::getData(query);
 
-    // only process edge in one direction
-    if (i > j)
-    {
-      // initialize to the first neighbor that's smaller
-      if (Kokkos::atomic_compare_exchange(&stat_(i), i, j) == i)
-        return;
-
     {
       // Per [1]:
       //
@@ -302,7 +295,6 @@ struct CCSCallback
       } while (repeat);
     }
   }
-  }
 };

This is hack because it a) assumes that queries are primitive points with a fixed radius, b) it only works for rope traversal, and c) the verify procedure now fails (but the results are correct, checked with dumping halo sizes and centers).

I got the following results on V100:

original version

total time      :      0.866
-> construction :      0.097
-> query+ccs    :      0.721
-> halos        :      0.048

pair hack

total time      :      0.502
-> construction :      0.087
-> query+ccs    :      0.368
-> halos        :      0.048

So, the traversal is 2x faster (1.7x overall). So, something that definitely should be pursued further.

P.S. I also tried to apply the occupancy hack on top of this, and got these results:

total time      :      0.435
-> construction :      0.087
-> query+ccs    :      0.300
-> halos        :      0.048

So it's about another 1.2x in traversal. At this point, query is just 3.5x slower than construction.

P.P.S. I am really interested to see if the general DBSCAN algorithm could also take advantage of this.

Oct 06 '20 17:10 aprokop

It just occurred to me that in the case we are interested (i.e., a cloud of points, and queries are spheres with a fixed radius), this could easily be done by using half-sphere primitive, i.e. a primitive of sphere intersected with half-space. I implemented this in a dbscan_halfsphere branch and tested on V100.

master 45172fc

total time          :      0.819                                                                                                                                                                                                                                                                                                                                                         
-- construction     :      0.098                                                                                                                                                                                                                                                                                                                                                         
-- query+cluster    :      0.700                                                                                                                                                                                                                                                                                                                                                         
-- postprocess      :      0.021

halo finding with half-sphere

total time          :      0.574                                                                                                                                                                                                                                                                                                                                                         
-- construction     :      0.097                                                                                                                                                                                                                                                                                                                                                         
-- query+cluster    :      0.455                                                                                                                                                                                                                                                                                                                                                         
-- postprocess      :      0.021

Certainly an improvement. But surprisingly, slower than the pair hack above. Maybe it's because we still have i == j situation, i.e. the point that the query is built on. The hack above allows us to avoid that situation, including extra traversal to descend to that node.

Feb 07 '21 05:02 aprokop