Enable update of primitives in existing BVH
The data structure seems to be reconstructed from scratch every time the primitives change (whether their property values or the number of primitives).
It would be beneficial to offer a public update method that can reuse existing storage whenever the primitives change, especially when the number of primitives does not change but their location/radius/etc does.
I think in general it is a good idea. The challenge is that there are few good update options. If we assume that the number of primitives does not change, a reasonable thing to do would be to recompute all bounding boxes while maintaining the same hierarchy structure. It would save on the initial sort of the primitives, but will have an extra cost of computing parents of each node (e.g., findParents call). An additional cost is the slowdown in the query, as presumably the quality of the hierarchy would be worse than if constructed from scratch.
We can certainly see if such an approach would provide a real benefit to your application. Generally, it would be most useful when the new primitives moved very slightly from their previous positions (if centroids of the primitives have not changed at all, it would be ideal, as the hierarchy structure would be the same). In most applications we've seen, the hierarchy construction takes a small fraction of the query. I would be interested to know if that's not the case for you.
Note: We have an additional problem if we try to implement this: we do not store permutation indices. Which means, even if the primitives have not changed at all, we would still need a sort function to match them to the leaf nodes. Without eliminating that, there's almost no difference between updating a hierarchy or building a new one. To avoid that, we would need to store permutation indices as part of the hierarchy. This slightly increases the memory required, though we could make it optional.
I did a quick and pretty dirty implementation of the update procedure that I was talking about (available here).
I ran the standard HACC problem on MI250X, first calling the standard constructor, then update with the same primitives.
37M: speedup 6.56 (2% of total DBSCAN time)
1.34e-01 sec 18.7% 90.9% 0.0% 0.1% 5.22e+01 1 ArborX::DBSCAN::tree_construction [region]
|-> 1.34e-01 sec 18.7% 91.0% 0.0% 0.0% 5.23e+01 1 ArborX::BVH::BVH [region]
|-> 1.05e-01 sec 14.6% 100.0% 0.0% 0.0% 9.52e+00 1 ArborX::BVH::BVH::calculate_scene_bounding_box [region]
|-> 1.57e-02 sec 2.2% 99.7% 0.0% 1.7% 2.54e+02 1 ArborX::BVH::BVH::generate_hierarchy [region]
|-> 1.19e-02 sec 1.7% 1.7% 0.0% 0.0% 8.41e+01 1 ArborX::BVH::BVH::sort_linearized_order [region]
2.04e-02 sec 2.8% 99.0% 0.0% 0.0% 2.45e+02 1 ArborX::DBSCAN::tree_update [region]
|-> 2.04e-02 sec 2.8% 99.1% 0.0% 2.0% 2.45e+02 1 ArborX::BVH::update [region]
|-> 1.71e-02 sec 2.4% 100.0% 0.0% ------ 1 ArborX::TreeConstruction::update_hierarchy [for]
|-> 2.68e-03 sec 0.4% 100.0% 0.0% ------ 1 ArborX::TreeConstruction::find_parents [for]
497M: speedup 1.88 (5% of total DBSCAN time)
4.98e-01 sec 1.9% 68.2% 0.0% 0.5% 1.40e+01 1 ArborX::DBSCAN::tree_construction [region]
|-> 4.96e-01 sec 1.9% 68.6% 0.0% 0.0% 1.41e+01 1 ArborX::BVH::BVH [region]
|-> 2.16e-01 sec 0.8% 101.0% 0.0% 0.1% 1.85e+01 1 ArborX::BVH::BVH::generate_hierarchy [region]
|-> 1.59e-01 sec 0.6% 1.6% 0.0% 0.0% 6.28e+00 1 ArborX::BVH::BVH::sort_linearized_order [region]
|-> 1.10e-01 sec 0.4% 100.0% 0.0% 0.0% 9.13e+00 1 ArborX::BVH::BVH::calculate_scene_bounding_box [region]
2.64e-01 sec 1.0% 100.7% 0.0% 0.0% 1.90e+01 1 ArborX::DBSCAN::tree_update [region]
|-> 2.64e-01 sec 1.0% 100.7% 0.0% 0.2% 1.90e+01 1 ArborX::BVH::update [region]
|-> 2.25e-01 sec 0.9% 100.0% 0.0% ------ 1 ArborX::TreeConstruction::update_hierarchy [for]
|-> 3.60e-02 sec 0.1% 100.0% 0.0% ------ 1 ArborX::TreeConstruction::find_parents [for]
I've also seen around 2x in Serial.
So, this is about as expected. We skip both sort and computing the scene bounding box. The latter is actually pretty expensive. The overall app improvement seems pretty insignificant in this case, and we may pay for it in the search.
Thanks for the quick answer!
Currently experimenting with performance portable libraries and focusing on single-node with more modest hardware. The problem size is (relatively) small compared to your tests, around 1M-2M points.
I might have missed the forest for the trees here 😄. Applying the prior bias "construction slow, reuse good" was not necessarily applicable. Now understanding better the profiling results of my current test, construction is around 20% (14% for construction) of the traversal time and very little of the overall computation step.
It seems in my results the GPU is idle 50% of the construction phase though, could allowing the user to provide the storage be beneficial to avoid allocation/deallocation? In general, more fine-grain control on the operations would be nice even if it is quite low level and a few % saved. (As a user of some other geometry libraries I have been accustomed to a more granular approach constructor/build/update/query)
Anyway my issue seems to fall currently towards micro-optimization at the moment, I would need to provide stronger evidence of bottleneck first. A hypothetical case for this maybe, would be the scenario of kinematically moved points/mesh where the positions update is trivial but occurring step-wise, I guess the ratio construction/query per would become more relevant?
I might have missed the forest for the trees here 😄. Applying the prior bias "construction slow, reuse good" was not necessarily applicable. Now understanding better the profiling results of my current test, construction is around 20% (14% for construction) of the traversal time and very little of the overall computation step.
Yeah, this generally falls within the range we've observed in applications.
It seems in my results the GPU is idle 50% of the construction phase though, could allowing the user to provide the storage be beneficial to avoid allocation/deallocation? In general, more fine-grain control on the operations would be nice even if it is quite low level and a few % saved. (As a user of some other geometry libraries I have been accustomed to a more granular approach constructor/build/update/query)
How do you measure GPU idleness? I'm not exactly sure about the storage. Specifically, given how during the construction there are multiple arrays getting allocated/deallocated, any improvements from providing storage would be in the noise (my personal expectation is less than 1%). What exactly do you have in mind? What libraries are you talking about? I could see it making a bit more sense for tiny data, when the construction time is on the order of allocation (e.g., less than 100 in size), but that's not our main objective.
Anyway my issue seems to fall currently towards micro-optimization at the moment, I would need to provide stronger evidence of bottleneck first. A hypothetical case for this maybe, would be the scenario of kinematically moved points/mesh where the positions update is trivial but occurring step-wise, I guess the ratio construction/query per would become more relevant?
What does "trivial" mean here? Does it mean that the points move a tiny fraction compared to inter-point spacing? I'm not sure how it affects the ratio of construction/query, as I assume the query dominates the thing. It would be interesting to try to reuse the previous time step results to speed up a query. I think I've seen some papers trying to do that, but not sure how helpful that would be. I think that would bring more benefit than improving the construction time.
Let's reframe a bit the discussion because it got me confused. I think there are two related topics at hand here. The first one was more about splitting the constructor from the building-initialization step, so that the object can reuse whatever memory already allocated. The second topic is about the optimization of the building phase itself and what piece of work can be discarded assuming small variations of primitive properties (centroids, radius, bounds, etc). The issue was referring more to the former topic rather than the latter, even if my poor phrasing might have indicated the opposite.
For example, Cabana VerletList splits the data structure into a constructor VerletList() and build() while ArborX BoundingVolumeHierarchy seems to do all the work in the constructor (besides the default constructor).
About the current message thread:
How do you measure GPU idleness?
The construction phase looks like this on nsight (on a Nvidia A10g), it seems to be quite a few of memory transaction but little kernel time. ("idle" was likely an improper description)
But that's probably because I have a relatively small amount of points (0.3~1.2M) compared to problems typically targeted by ArborX?
Generally, it would be most useful when the new primitives moved very slightly from their previous positions (if centroids of the primitives have not changed at all, it would be ideal, as the hierarchy structure would be the same).
That's commonly the case for simulation with explicit timestepping, the centroids would have moved only slightly compared to neighboring inter-point distance. The whole tree could be rebuilt every N steps instead. I am not sure how ArborX proceeds, could there be a risk of false positive/negative if no user-given margin is present?
In most applications we've seen, the hierarchy construction takes a small fraction of the query. I would be interested to know if that's not the case for you.
Indeed, how small a fraction depends. In my (small) problem the whole BVH constructor was still ~20% of the query time.
So, this is about as expected. We skip both sort and computing the scene bounding box. The latter is actually pretty expensive. The overall app improvement seems pretty insignificant in this case, and we may pay for it in the search.
I'm not sure how it affects the ratio of construction/query, as I assume the query dominates the thing. It would be interesting to try to reuse the previous time step results to speed up a query. I think I've seen some papers trying to do that, but not sure how helpful that would be. I think that would bring more benefit than improving the construction time.
Anything to speed up the query (which dominates yes) is welcome, but that seems a tougher topic. I am not sure what is the conclusion on your side, is it too negligible to consider?
Let's reframe a bit the discussion because it got me confused. I think there are two related topics at hand here. The first one was more about splitting the constructor from the building-initialization step, so that the object can reuse whatever memory already allocated. The second topic is about the optimization of the building phase itself and what piece of work can be discarded assuming small variations of primitive properties (centroids, radius, bounds, etc). The issue was referring more to the former topic rather than the latter, even if my poor phrasing might have indicated the opposite.
OK, thank you for the clarification. Regarding the first one, BVH stores only two arrays, leaf nodes and internal nodes. So, you are talking about about reusing those two arrays. It's possible, but I'd like to see data indicating that it's worth it.
For example, Cabana VerletList splits the data structure into a constructor VerletList() and build() while ArborX BoundingVolumeHierarchy seems to do all the work in the constructor (besides the default constructor).
I took a quick look at Cabana. I'm not entirely sure what it's reusing.
The construction phase looks like this on nsight (on a Nvidia A10g), it seems to be quite a few of memory transaction but little kernel time. ("idle" was likely an improper description)
But that's probably because I have a relatively small amount of points (0.3~1.2M) compared to problems typically targeted by ArborX?
Can you share the trace? Alternatively, can you see what percentage of overall construction are times to allocate and deallocate leaf and internal nodes are?
That's commonly the case for simulation with explicit timestepping, the centroids would have moved only slightly compared to neighboring inter-point distance. The whole tree could be rebuilt every N steps instead. I am not sure how ArborX proceeds, could there be a risk of false positive/negative if no user-given margin is present?
If we do it the way I wrote in that branch, the results are always correct, as the bounding boxes are updated. If the bounding boxes are not updated, then yeah, some results may be missed.
Indeed, how small a fraction depends. In my (small) problem the whole BVH constructor was still ~20% of the query time.
How much of that is the leaf and internal nodes allocation/deallocation? In my experience, they take a negligible part of the hierarchy construction, and never show up in the timers.
But that's probably because I have a relatively small amount of points (0.3~1.2M) compared to problems typically targeted by ArborX?