Add DistributedTree query only taking a callback
This adds a
template <typename ExecutionSpace, typename Predicates, typename Callback>
void query(ExecutionSpace const& space,
Predicates const& predicates,
Callback const& callback) const;
overload for DistributedTree (and `SpatialTag) with the callback being executed on the MPI process storing the primitives.
@masterleinad Could you please provide the use cases where this would be useful? I'm struggling to find a situation in the distributed setting where we would want to do the pure callback without communicating anything back as the results.
@masterleinad Could you please proved the use cases where this would be useful? I'm struggling to find a situation in the distributed setting where we would want to do the pure callback without communicating anything back as the results.
Something like counting for each box the number of rays that intersect it. I was mainly thinking about symmetry of the interface with the non-distributed version and could also see a version that would communicate back and invoke the callback on the MPI rank owning the queries (without storing results) for, e.g, counting for each ray the number of boxes it intersects,
Something like counting for each box the number of rays that intersect it.
This kind of semantically inverts the problem to "for each primitive find all queries that intersect it". Having said that, I can see how we can decide what primitives are based on their geometric properties (building a hierarchy on rays is not feasible).
I was mainly thinking about symmetry of the interface with the non-distributed version
I don't think they have to implement the exactly same interface. Sure, the functions they do implement should strive to have the same signature, but it's not necessarily the same set.
and could also see a version that would communicate back and invoke the callback on the MPI rank owning the queries (without storing results) for, e.g, counting for each ray the number of boxes it intersects
Counting the numbers of neighbors is certainly an interesting task that I am not sure how to do in the distributed setting at this point.
Pure spatial callback in the distributed setting is like a broadcast of information to the closest entities. And pure nearest callback is strange due to calling it twice (potentially with side effects).
I would propose that we are conservative in enabling the new calls, only introducing them when a concrete use case or need arises. Thus, I would suggest we don't do the pure callback at the moment.
I would find this interface to be very useful for implementing the Sierra Toolkit search interface. For the serial BVH, changing from the CSR interface to the callback interface reduced the runtime for one of our use cases by 40%, and high water mark memory usage by 38%.
With the CSR interface, we had to run a second kernal after calling ArborX to reformat the data to fit our API. With the callback, we could put the data into the desired format directly.
@JaredCrean2 Can you please describe your use case in more details? There are some subtleties with pure distributed callbacks, such as all operations for a query happening only on the rank that possesses the result. I want to make sure that it actually does what you need it to.
I would like a callback
operator()( /*type returned by AccessTraits<ViewPassedToConstructor>::get() */ primitive,
/* value_type for View that gets passed into query function */ predicate,
int predicate_rank);
that gets called once per match on the process that owns the primitive. The process that owns the predicate would be indicated by the predicate_rank argument. In this API, the conversion of the user predicate into an ArborX predicate (via AccessTraits would have to happen on the process that owns the primitive (ie. the user type would be sent over MPI). This is important because STK supports types that need additional intersection checking (see below).
The predicate_rank argument could be removed (because the MPI rank could be included in the predicate object), but I think it is helpful clarifying the API and making it easier for other users who might not already include that information in their predicate.
The API for STK is (roughly):
Inputs:
-
Kokkos::View<BoxIdentProc*> localDomain -
Kokkos::View<BoxIdentProc*> localRange
where BoxIdentProc is a little struct that contains a shape of some kind (bounding box, sphere, point, etc.) and a struct identifying the shape (local id + MPI rank).
Output:
-
Kokkos::View<IdentIntersection*> results
where IdentIntersection is a a little struct that contains the IdentProc part of the domain and range for each match. The domain part is always owned by the local process. The range part may be owned by a remote process. We can optionally symmetrize the results (so the intersection is returned on the process that owns the range object too).
STK supports shapes for which an axis-aligned bounding box is not exact, for example spheres. So, after ArborX returns a match, we may have to do an additional intersection check to see if the underlying shapes really intersect, and only record the match in results if true. To do this, we will need ArborX to pass in the underlying STK data that was passed into the query() function, rather than the ArborX predicate it is converted to via AccessTraits::get().
If you agree with this, it might make sense to change the first argument to the callback to be the user data passed into the constructor as well. Then the API (constructor, query, callback) will be entirely in terms of user data types, and the conversion to ArborX data types via AccessTraits will be completely internal to ArborX.
BTW, the AccessTraits made implementing the callback + conversion of STK shapes to ArborX object very easy. Great design choice.
I think I understand most of what you are saying. The main thing I don't understand is the output. Pure callbacks would not produce any, unless you preallocate some storage and make it a part of the callback.
To go into more details.
I would like a callback
operator()( /*type returned by AccessTraits<ViewPassedToConstructor>::get() */ primitive, /* value_type for View that gets passed into query function */ predicate, int predicate_rank);that gets called once per match on the process that owns the primitive. The process that owns the predicate would be indicated by the
predicate_rankargument. In this API, the conversion of the user predicate into an ArborX predicate (viaAccessTraitswould have to happen on the process that owns the primitive (ie. the user type would be sent over MPI). This is important because STK supports types that need additional intersection checking (see below).
So, that's doable and flexible. You can attach whatever data is necessary to the predicate on the process that you launch it on, including the MPI rank. You also have two options. Option 1 is to make the predicate based on your custom geometry. This way, you would have to (potentially) write your own intersection tests with geometries, but would not need to attach the fine predicate. The main drawback of this approach is likely the cost of all the intersections. Option 2 is to create a simplified geometry (coarse) and attach the fine geometry as data. Then in a callback to do the fine intersects to remove false positives. Whether option 1 or 2 is faster depends on the specific problem.
While our old API only allowed the results of Access::get to be a point or a box, the updated API allows it to be any type. We changed the mechanism to follow Boost.Geometry, including indexable getter. This is an experimental for now, and unfortunately, lacking documentation. I can explain later how to use it.
Output:
* `Kokkos::View<IdentIntersection*> results`where
IdentIntersectionis a a little struct that contains theIdentProcpart of thedomainandrangefor each match. Thedomainpart is always owned by the local process. Therangepart may be owned by a remote process. We can optionally symmetrize the results (so the intersection is returned on the process that owns the range object too).
This is the part I'm most confused about. I'm not sure what results would store. For a pure callback, this will have to be preallocated on the primitives ranks. This essentially requires knowing the number of results in advance. For example, that there is at most one intersection with a primitive, or something like that, if you store it. For other things, for example, computing the minimum distance, or something that can be done with atomics, this can be relaxed.
STK supports shapes for which an axis-aligned bounding box is not exact, for example spheres. So, after ArborX returns a match, we may have to do an additional intersection check to see if the underlying shapes really intersect, and only record the match in
resultsif true. To do this, we will need ArborX to pass in the underlying STK data that was passed into thequery()function, rather than the ArborX predicate it is converted to viaAccessTraits::get().
That's fine. As I mentioned, we now support both coarse and fine intersections, and you can do it either way.
If you agree with this, it might make sense to change the first argument to the callback to be the user data passed into the constructor as well. Then the API (constructor, query, callback) will be entirely in terms of user data types, and the conversion to ArborX data types via
AccessTraitswill be completely internal to ArborX.
If I understand it correctly, this is already all present in ArborX. You can construct a hierarchy based on your data types, which will be stored in leaves and used in the callback for intersection. You can/should also specify an indexable getter, to convert it to a bounding volume, or provide expand functions to construct the hierarchy based on custom bounding volumes.
Option 2 is to create a simplified geometry (coarse) and attach the fine geometry as data. Then in a callback to do the fine intersects to remove false positives
Yep, this is what I was thinking.
You can attach whatever data is necessary to the predicate on the process that you launch it on, including the MPI rank.
Interesting, I see the PredicateWithAttachment allows attaching arbitrary data to the predicate. This should work. Now that I know this, I think the callback should be the same signature as the BVH callback: operator()( /* type returned by AccessTraits::get() for the predicate */ predicate, int primitive_idx). If it is guaranteed that the primitive is owned by the local process, that should be enough to retrieve the underlying shape and compute the intersection.
I'm not sure what
resultswould store. For a pure callback, this will have to be preallocated on the primitives ranks. This essentially requires knowing the number of results in advance.
We typically preallocate if we have a reasonable guess for how many results there will be. If we run out of space, we just count the number of results and run the query a second time.
If I understand it correctly, this is already all present in ArborX.
It looks like all the pieces are there, just need the DistributedTree to support the callback, so we can avoid running a second kernel afterwards to reformat the data.
Couple builds (one CUDA, one SYCL) fail for unrelated reasons.
@JaredCrean2 Please give it a try to see if you can make things work in STK.