ArborX Add DistributedTree query only taking a callback

This adds a

template <typename ExecutionSpace, typename Predicates, typename Callback>
void query(ExecutionSpace const& space,
           Predicates const& predicates,
           Callback const& callback) const;

overload for DistributedTree (and `SpatialTag) with the callback being executed on the MPI process storing the primitives.

Sep 01 '22 18:09 masterleinad

@masterleinad Could you please provide the use cases where this would be useful? I'm struggling to find a situation in the distributed setting where we would want to do the pure callback without communicating anything back as the results.

Sep 01 '22 19:09 aprokop

@masterleinad Could you please proved the use cases where this would be useful? I'm struggling to find a situation in the distributed setting where we would want to do the pure callback without communicating anything back as the results.

Something like counting for each box the number of rays that intersect it. I was mainly thinking about symmetry of the interface with the non-distributed version and could also see a version that would communicate back and invoke the callback on the MPI rank owning the queries (without storing results) for, e.g, counting for each ray the number of boxes it intersects,

Sep 01 '22 19:09 masterleinad

Something like counting for each box the number of rays that intersect it.

This kind of semantically inverts the problem to "for each primitive find all queries that intersect it". Having said that, I can see how we can decide what primitives are based on their geometric properties (building a hierarchy on rays is not feasible).

I was mainly thinking about symmetry of the interface with the non-distributed version

I don't think they have to implement the exactly same interface. Sure, the functions they do implement should strive to have the same signature, but it's not necessarily the same set.

and could also see a version that would communicate back and invoke the callback on the MPI rank owning the queries (without storing results) for, e.g, counting for each ray the number of boxes it intersects

Counting the numbers of neighbors is certainly an interesting task that I am not sure how to do in the distributed setting at this point.

Pure spatial callback in the distributed setting is like a broadcast of information to the closest entities. And pure nearest callback is strange due to calling it twice (potentially with side effects).

I would propose that we are conservative in enabling the new calls, only introducing them when a concrete use case or need arises. Thus, I would suggest we don't do the pure callback at the moment.

Sep 01 '22 19:09 aprokop

I would find this interface to be very useful for implementing the Sierra Toolkit search interface. For the serial BVH, changing from the CSR interface to the callback interface reduced the runtime for one of our use cases by 40%, and high water mark memory usage by 38%.

With the CSR interface, we had to run a second kernal after calling ArborX to reformat the data to fit our API. With the callback, we could put the data into the desired format directly.

Jun 25 '24 21:06 JaredCrean2

@JaredCrean2 Can you please describe your use case in more details? There are some subtleties with pure distributed callbacks, such as all operations for a query happening only on the rank that possesses the result. I want to make sure that it actually does what you need it to.

Jun 25 '24 21:06 aprokop

I would like a callback

operator()( /*type returned by AccessTraits<ViewPassedToConstructor>::get() */ primitive,
                    /* value_type for View that gets passed into query function */ predicate,
                    int predicate_rank);

that gets called once per match on the process that owns the primitive. The process that owns the predicate would be indicated by the predicate_rank argument. In this API, the conversion of the user predicate into an ArborX predicate (via AccessTraits would have to happen on the process that owns the primitive (ie. the user type would be sent over MPI). This is important because STK supports types that need additional intersection checking (see below).

The predicate_rank argument could be removed (because the MPI rank could be included in the predicate object), but I think it is helpful clarifying the API and making it easier for other users who might not already include that information in their predicate.

The API for STK is (roughly):

Inputs:

Kokkos::View<BoxIdentProc*> localDomain
Kokkos::View<BoxIdentProc*> localRange

where BoxIdentProc is a little struct that contains a shape of some kind (bounding box, sphere, point, etc.) and a struct identifying the shape (local id + MPI rank).

Output:

Kokkos::View<IdentIntersection*> results

where IdentIntersection is a a little struct that contains the IdentProc part of the domain and range for each match. The domain part is always owned by the local process. The range part may be owned by a remote process. We can optionally symmetrize the results (so the intersection is returned on the process that owns the range object too).

STK supports shapes for which an axis-aligned bounding box is not exact, for example spheres. So, after ArborX returns a match, we may have to do an additional intersection check to see if the underlying shapes really intersect, and only record the match in results if true. To do this, we will need ArborX to pass in the underlying STK data that was passed into the query() function, rather than the ArborX predicate it is converted to via AccessTraits::get().

If you agree with this, it might make sense to change the first argument to the callback to be the user data passed into the constructor as well. Then the API (constructor, query, callback) will be entirely in terms of user data types, and the conversion to ArborX data types via AccessTraits will be completely internal to ArborX.

BTW, the AccessTraits made implementing the callback + conversion of STK shapes to ArborX object very easy. Great design choice.

Jun 25 '24 22:06 JaredCrean2

I think I understand most of what you are saying. The main thing I don't understand is the output. Pure callbacks would not produce any, unless you preallocate some storage and make it a part of the callback.

To go into more details.

I would like a callback
operator()( /*type returned by AccessTraits<ViewPassedToConstructor>::get() */ primitive,
                    /* value_type for View that gets passed into query function */ predicate,
                    int predicate_rank);          
that gets called once per match on the process that owns the primitive. The process that owns the predicate would be indicated by the predicate_rank argument. In this API, the conversion of the user predicate into an ArborX predicate (via AccessTraits would have to happen on the process that owns the primitive (ie. the user type would be sent over MPI). This is important because STK supports types that need additional intersection checking (see below).

So, that's doable and flexible. You can attach whatever data is necessary to the predicate on the process that you launch it on, including the MPI rank. You also have two options. Option 1 is to make the predicate based on your custom geometry. This way, you would have to (potentially) write your own intersection tests with geometries, but would not need to attach the fine predicate. The main drawback of this approach is likely the cost of all the intersections. Option 2 is to create a simplified geometry (coarse) and attach the fine geometry as data. Then in a callback to do the fine intersects to remove false positives. Whether option 1 or 2 is faster depends on the specific problem.

While our old API only allowed the results of Access::get to be a point or a box, the updated API allows it to be any type. We changed the mechanism to follow Boost.Geometry, including indexable getter. This is an experimental for now, and unfortunately, lacking documentation. I can explain later how to use it.

Output:
* `Kokkos::View<IdentIntersection*> results`
where IdentIntersection is a a little struct that contains the IdentProc part of the domain and range for each match. The domain part is always owned by the local process. The range part may be owned by a remote process. We can optionally symmetrize the results (so the intersection is returned on the process that owns the range object too).

This is the part I'm most confused about. I'm not sure what results would store. For a pure callback, this will have to be preallocated on the primitives ranks. This essentially requires knowing the number of results in advance. For example, that there is at most one intersection with a primitive, or something like that, if you store it. For other things, for example, computing the minimum distance, or something that can be done with atomics, this can be relaxed.

STK supports shapes for which an axis-aligned bounding box is not exact, for example spheres. So, after ArborX returns a match, we may have to do an additional intersection check to see if the underlying shapes really intersect, and only record the match in results if true. To do this, we will need ArborX to pass in the underlying STK data that was passed into the query() function, rather than the ArborX predicate it is converted to via AccessTraits::get().

That's fine. As I mentioned, we now support both coarse and fine intersections, and you can do it either way.

If you agree with this, it might make sense to change the first argument to the callback to be the user data passed into the constructor as well. Then the API (constructor, query, callback) will be entirely in terms of user data types, and the conversion to ArborX data types via AccessTraits will be completely internal to ArborX.

If I understand it correctly, this is already all present in ArborX. You can construct a hierarchy based on your data types, which will be stored in leaves and used in the callback for intersection. You can/should also specify an indexable getter, to convert it to a bounding volume, or provide expand functions to construct the hierarchy based on custom bounding volumes.

Jun 25 '24 23:06 aprokop

Option 2 is to create a simplified geometry (coarse) and attach the fine geometry as data. Then in a callback to do the fine intersects to remove false positives

Yep, this is what I was thinking.

You can attach whatever data is necessary to the predicate on the process that you launch it on, including the MPI rank.

Interesting, I see the PredicateWithAttachment allows attaching arbitrary data to the predicate. This should work. Now that I know this, I think the callback should be the same signature as the BVH callback: operator()( /* type returned by AccessTraits::get() for the predicate */ predicate, int primitive_idx). If it is guaranteed that the primitive is owned by the local process, that should be enough to retrieve the underlying shape and compute the intersection.

I'm not sure what results would store. For a pure callback, this will have to be preallocated on the primitives ranks. This essentially requires knowing the number of results in advance.

We typically preallocate if we have a reasonable guess for how many results there will be. If we run out of space, we just count the number of results and run the query a second time.

If I understand it correctly, this is already all present in ArborX.

It looks like all the pieces are there, just need the DistributedTree to support the callback, so we can avoid running a second kernel afterwards to reformat the data.

Jun 26 '24 17:06 JaredCrean2

Couple builds (one CUDA, one SYCL) fail for unrelated reasons.

Jul 01 '24 18:07 aprokop

@JaredCrean2 Please give it a try to see if you can make things work in STK.

Jul 01 '24 18:07 aprokop