OpenCL-Docs icon indicating copy to clipboard operation
OpenCL-Docs copied to clipboard

an extension that adds a device side abort function

Open pjaaskel opened this issue 3 years ago • 5 comments

With the device side abort extension, a work-item (WI) can return from the kernel execution at any point and cause abnormal unrecoverable termination of the host process.

This extension differs from the cl_arm_controlled_kernel_termination extension in that the abort in the device side is expected to behave like a call to the POSIX abort() in the host side, terminating also the host process immediately.

The extension is meant to easily support POSIX abort()-like functionality on the device side, as well as serve as a basis for CUDA/HIP-style assertions.

An example CPU implementation in PoCL: https://github.com/parmance/pocl/commit/d5e88c700889be4b1296faf4a261810580829d1b

pjaaskel avatar Jun 21 '22 12:06 pjaaskel

Here is the accompanying SPIRV extension: https://github.com/KhronosGroup/SPIRV-Registry/pull/149

Kerilk avatar Jul 12 '22 17:07 Kerilk

Discussed in the September 20th teleconference:

  • Use-case: similar to __trap() in CUDA.
  • There are several similar proposals and it would be nice to consolidate into an EXT extension (this PR).
  • There is a related SPIR-V extension (linked above) but it has not been implemented (yet).
  • As-written, the extension will terminate the entire process when a device kernel calls the abort function.
  • There could be use-cases that aren't as catastrophic. Do we want to support recovery, and if so, how recoverable can we be?
  • For example, how does calling abort affect other work-items or work-groups that may be executing? How does calling abort affect other commands in the command-queue that may be dependent on the aborted command?

bashbaug avatar Sep 22 '22 00:09 bashbaug

There could be use-cases that aren't as catastrophic. Do we want to support recovery, and if so, how recoverable can we be?

Like I suggested in the call, I think these are two different use cases which could call for separate extensions to not make it too difficult to support only one of them. The primary use case for this simple one is to be part of an assert() implementation: To allow more easy porting (even automated migration) of host functions that have asserts() or abort() calls to device-side executed code.

For example, how does calling abort affect other work-items or work-groups that may be executing? How does calling abort affect other commands in the command-queue that may be dependent on the aborted command?

In this extension, the expected behavior for commands is the same as with any multithreaded program where one thread calls the standard abort(). Other parallel threads (commands) might have proceeded further or not, but the end result is either catching SIGABRT or brutally killing the process along with its threads (in this case also device/GPU threads).

The WI semantics I tried to describe in the last paragraph: https://github.com/KhronosGroup/OpenCL-Docs/pull/808/files#diff-149e893d23663ca01188af6d03a0ebd77bae5776abefe7ec6b063e4bbd88212fR103

pjaaskel avatar Sep 22 '22 07:09 pjaaskel

Discussed in the December 17th teleconference. Some of this functionality appears to be implemented in Mesa:

https://www.phoronix.com/news/OpenCL-C-Std-Lib-Mesa-25.0 https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32529

We should see what it would take to get this across the finish line.

bashbaug avatar Dec 17 '24 19:12 bashbaug

@karolherbst do you have feedback on this from the perspective of Mesa/Rusticl?

pjaaskel avatar Jul 24 '25 14:07 pjaaskel