[CUDA] Max local mem size check should return OUT_OF_RESOURCES

Open rafbiels opened this issue 1 year ago • 1 comments

Building on top of https://github.com/intel/llvm/pull/12604 + https://github.com/oneapi-src/unified-runtime/pull/1318 which adds handleOutOfResources to dpcpp and returns UR_RESULT_ERROR_OUT_OF_RESOURCES, the local mem size check: https://github.com/oneapi-src/unified-runtime/blob/f086f369cab557bf2a589e22bfc37e18d7de5fa8/source/adapters/cuda/enqueue.cpp#L294-L298 should also return UR_RESULT_ERROR_OUT_OF_RESOURCES and have dedicated error handling case added in handleOutOfResources.

Right now submitting a kernel with too large local mem size results in:

Native API failed. Native API returns: -996 (The plugin has emitted a backend specific error)
Excessive allocation of local memory on the device
 -996 (The plugin has emitted a backend specific error)

which does contain a helpful exception message, but wrapped in generic and confusing "backend specific error" messages and the unhelpful code -996. Having this returning ERROR_OUT_OF_RESOURCES would make it easier for us to cover in the troubleshooting guide, and for users to find it with web search engines.

Feb 08 '24 13:02 rafbiels

@GeorgeWeb I've assigned this to you since its building on top of your PR's.

Feb 15 '24 13:02 kbenzie