iree icon indicating copy to clipboard operation
iree copied to clipboard

Add an experimental HSA backend

Open maawad opened this issue 6 months ago • 2 comments

This PR proposes adding an experimental Heterogeneous System Architecture (HSA) backend for IREE. HSA provides standard APIs to manage and manipulate low-level device(s) primitives such as queues, signals, and memory pools, and the proposed backend surfaces these primitives to the HAL layer. I marked the PR as a draft and would love your feedback. I will happily address any comments or discuss code changes in comments (or walk through code changes on Teams). Jose and I are the coauthors of this work from AMD's RAD team.

Additional notes:

  • The backend implementation started as a copy of the HIP backend (commit hash 9e95c38fdf1274e17eef521edc8536b3f10f791b), which I reduced to the barebones requirements for dispatching packets.
  • The backend implements a simple single-queue-based dispatching backend.
  • HIP events are replaced with barrier packets with completion signals applying a user-defined function when the packet is reached.
  • The implementation only uses a fine-grained memory pool to service the allocations.

Except for the issues below all other 106 unit tests are passing. I tested on gfx1103 and ROCm 6.2.0 but can/will test on other chips as well.

Known issues (at the moment):

  • ROCr lacks some of the async memory-copy and fill APIs. Resolving these will require feature requests in ROCr or custom kernels in IREE.
  • The deferred execution is currently failing (I believe this is because the binding tables are not correctly passed through -- a recent change that is not incorporated here).
  • Some of the module runs tests are currently also failing (See the CMakeLists.txt files for tests)
  • Semaphore tests WaitThenFail and MultiWaitThenFail are failing at the moment.

Some possible future improvements:

  • It is possible to implement a graph-based command buffer in which the execution graph can be assembled from queues, signals, packets, and barrier dispatches chained together.
  • Different memory pools (e.g., coarse- and fine-grained) can be used to service allocations for different coherence guarantees.

maawad avatar Aug 26 '24 01:08 maawad