cuCollections
cuCollections copied to clipboard
Priority Queue
Adds a GPU-accelerated priority queue
Allows for multiple concurrent insertions as well as multiple concurrent deletions.
The implementation of the priority queue is based on https://arxiv.org/pdf/1906.06504.pdf.
The queue supports two operations:
push: Add elements into the queue
pop: Remove the element(s) with the lowest (when Max == false) or highest
(when Max == true) keys
The priority queue supports bulk host-side operations and more fine-grained device-side operations.
The host-side bulk operations push and pop allow an arbitrary number of
elements to be pushed to or popped from the queue.
The device-side operations allow a cooperative group to push or pop
some number of elements less than or equal to node_size. These device side
operations are invoked with a trivially-copyable device view,
device_mutable_view which can be obtained with the host function
get_mutable_device_view and passed to the device.
Current limitations:
- Only supports trivially comparable key types
- Does not support insertion and deletion at the same time
- Capacity is fixed and the queue does not automatically resize
- Deletion from the queue is much slower than insertion into the queue due to congestion at the underlying heap's root node
TODO: Port tests to Catch2 and benchmarks to google benchmark
Can one of the admins verify this patch?
I reviewed the top level header at this point and gave some thoughts/questions on how to make this a little more generic.
Hi Jake,
Thank you for your comments! I've done the following to address them:
- I've changed the queue to accept arbitrary types instead of just pairs, including a custom comparison operator.
- I've added the option to specify a custom allocator.
- I've changed the push and pop functions to accept iterators instead of pointers.
- I've hidden kernel launch details from the user interface.
- I hard coded block sizes in
priority_queue.inlsimilarly to what is done instatic_map. - I hid the internal node size under a boolean template parameter FavorInsertionPerformance allowing the user to sacrifice deletion performance for insertion performance (by decreasing the node size).
- I've removed warp level insertion and deletion as it didn't offer any performance improvement in my benchmarking. Furthermore, the user could achieve this if necessary by writing a kernel and using the Device-side API.
- I hard coded block sizes in
Additionally, I have ported my tests to Catch2 and my benchmarks to gbenchmark.
@andrewbriand left a few more minor comments on the top-level header. Great work! This is very close to done.
@jrhemstad Thank you very much for your comments! I've responded to them above, let me know what you think.
@andrewbriand Can you resolve the conflict in CMake?
@andrewbriand Can you resolve the conflict in CMake?
@PointKernel Done!
Hey @andrewbriand thanks again for the PR! Apologies for letting this languish for so long!
@PointKernel has kindly volunteered to take on shepherding this to getting it ready for merging.
ok to test
@PointKernel Thanks for your comments! I believe that I have addressed or responded to them all. Please let me know what you think and what other comments you might have.
@andrewbriand Can you please also merge with the latest dev branch and fix build warnings (if there is any)?