cuCollections Priority Queue

Adds a GPU-accelerated priority queue

Allows for multiple concurrent insertions as well as multiple concurrent deletions.

The implementation of the priority queue is based on https://arxiv.org/pdf/1906.06504.pdf.

The queue supports two operations: push: Add elements into the queue pop: Remove the element(s) with the lowest (when Max == false) or highest (when Max == true) keys

The priority queue supports bulk host-side operations and more fine-grained device-side operations.

The host-side bulk operations push and pop allow an arbitrary number of elements to be pushed to or popped from the queue.

The device-side operations allow a cooperative group to push or pop some number of elements less than or equal to node_size. These device side operations are invoked with a trivially-copyable device view, device_mutable_view which can be obtained with the host function get_mutable_device_view and passed to the device.

Current limitations:

Only supports trivially comparable key types
Does not support insertion and deletion at the same time
Capacity is fixed and the queue does not automatically resize
Deletion from the queue is much slower than insertion into the queue due to congestion at the underlying heap's root node

TODO: Port tests to Catch2 and benchmarks to google benchmark

Sep 10 '21 21:09 andrewbriand

Can one of the admins verify this patch?

Sep 10 '21 21:09 GPUtester

I reviewed the top level header at this point and gave some thoughts/questions on how to make this a little more generic.

Sep 10 '21 21:09 jrhemstad

Hi Jake,

Thank you for your comments! I've done the following to address them:

I've changed the queue to accept arbitrary types instead of just pairs, including a custom comparison operator.
I've added the option to specify a custom allocator.
I've changed the push and pop functions to accept iterators instead of pointers.
I've hidden kernel launch details from the user interface.
- I hard coded block sizes in priority_queue.inl similarly to what is done in static_map.
- I hid the internal node size under a boolean template parameter FavorInsertionPerformance allowing the user to sacrifice deletion performance for insertion performance (by decreasing the node size).
- I've removed warp level insertion and deletion as it didn't offer any performance improvement in my benchmarking. Furthermore, the user could achieve this if necessary by writing a kernel and using the Device-side API.

Additionally, I have ported my tests to Catch2 and my benchmarks to gbenchmark.

Dec 28 '21 00:12 andrewbriand

@andrewbriand left a few more minor comments on the top-level header. Great work! This is very close to done.

Jan 06 '22 16:01 jrhemstad

@jrhemstad Thank you very much for your comments! I've responded to them above, let me know what you think.

Apr 13 '22 04:04 andrewbriand

@andrewbriand Can you resolve the conflict in CMake?

Apr 13 '22 20:04 PointKernel

@andrewbriand Can you resolve the conflict in CMake?

@PointKernel Done!

Apr 14 '22 03:04 andrewbriand

Hey @andrewbriand thanks again for the PR! Apologies for letting this languish for so long!

@PointKernel has kindly volunteered to take on shepherding this to getting it ready for merging.

May 18 '22 19:05 jrhemstad

ok to test

Jun 15 '22 17:06 PointKernel

@PointKernel Thanks for your comments! I believe that I have addressed or responded to them all. Please let me know what you think and what other comments you might have.

Jun 19 '22 01:06 andrewbriand

@andrewbriand Can you please also merge with the latest dev branch and fix build warnings (if there is any)?

Jun 19 '22 20:06 PointKernel

cuCollections cuCollections copied to clipboard

Priority Queue

cuCollections
cuCollections copied to clipboard