cuCollections icon indicating copy to clipboard operation
cuCollections copied to clipboard

Priority Queue

Open andrewbriand opened this issue 4 years ago • 13 comments

Adds a GPU-accelerated priority queue

Allows for multiple concurrent insertions as well as multiple concurrent deletions.

The implementation of the priority queue is based on https://arxiv.org/pdf/1906.06504.pdf.

The queue supports two operations: push: Add elements into the queue pop: Remove the element(s) with the lowest (when Max == false) or highest (when Max == true) keys

The priority queue supports bulk host-side operations and more fine-grained device-side operations.

The host-side bulk operations push and pop allow an arbitrary number of elements to be pushed to or popped from the queue.

The device-side operations allow a cooperative group to push or pop some number of elements less than or equal to node_size. These device side operations are invoked with a trivially-copyable device view, device_mutable_view which can be obtained with the host function get_mutable_device_view and passed to the device.

Current limitations:

  • Only supports trivially comparable key types
  • Does not support insertion and deletion at the same time
  • Capacity is fixed and the queue does not automatically resize
  • Deletion from the queue is much slower than insertion into the queue due to congestion at the underlying heap's root node

TODO: Port tests to Catch2 and benchmarks to google benchmark

andrewbriand avatar Sep 10 '21 21:09 andrewbriand

Can one of the admins verify this patch?

GPUtester avatar Sep 10 '21 21:09 GPUtester

I reviewed the top level header at this point and gave some thoughts/questions on how to make this a little more generic.

jrhemstad avatar Sep 10 '21 21:09 jrhemstad

Hi Jake,

Thank you for your comments! I've done the following to address them:

  1. I've changed the queue to accept arbitrary types instead of just pairs, including a custom comparison operator.
  2. I've added the option to specify a custom allocator.
  3. I've changed the push and pop functions to accept iterators instead of pointers.
  4. I've hidden kernel launch details from the user interface.
    • I hard coded block sizes in priority_queue.inl similarly to what is done in static_map.
    • I hid the internal node size under a boolean template parameter FavorInsertionPerformance allowing the user to sacrifice deletion performance for insertion performance (by decreasing the node size).
    • I've removed warp level insertion and deletion as it didn't offer any performance improvement in my benchmarking. Furthermore, the user could achieve this if necessary by writing a kernel and using the Device-side API.

Additionally, I have ported my tests to Catch2 and my benchmarks to gbenchmark.

andrewbriand avatar Dec 28 '21 00:12 andrewbriand

@andrewbriand left a few more minor comments on the top-level header. Great work! This is very close to done.

jrhemstad avatar Jan 06 '22 16:01 jrhemstad

@jrhemstad Thank you very much for your comments! I've responded to them above, let me know what you think.

andrewbriand avatar Apr 13 '22 04:04 andrewbriand

@andrewbriand Can you resolve the conflict in CMake?

PointKernel avatar Apr 13 '22 20:04 PointKernel

@andrewbriand Can you resolve the conflict in CMake?

@PointKernel Done!

andrewbriand avatar Apr 14 '22 03:04 andrewbriand

Hey @andrewbriand thanks again for the PR! Apologies for letting this languish for so long!

@PointKernel has kindly volunteered to take on shepherding this to getting it ready for merging.

jrhemstad avatar May 18 '22 19:05 jrhemstad

ok to test

PointKernel avatar Jun 15 '22 17:06 PointKernel

@PointKernel Thanks for your comments! I believe that I have addressed or responded to them all. Please let me know what you think and what other comments you might have.

andrewbriand avatar Jun 19 '22 01:06 andrewbriand

@andrewbriand Can you please also merge with the latest dev branch and fix build warnings (if there is any)?

PointKernel avatar Jun 19 '22 20:06 PointKernel