unified-runtime icon indicating copy to clipboard operation
unified-runtime copied to clipboard

[OpenCL][USM] urEnqueueUSMFill incorrectly assumes destination memory alignment and fails

Open rafbiels opened this issue 1 year ago • 7 comments

TLDR: OpenCL adapter implementation of urEnqueueUSMFill calls clEnqueueMemFillINTEL for power-of-2 pattern size without checking destination memory alignment required by clEnqueueMemFillINTEL.

Full Story: I ran into this issue when trying to add sycl::queue::fill test in https://github.com/intel/llvm/pull/15991 (specific CI run failure: https://github.com/intel/llvm/actions/runs/11971631535/job/33695502538?pr=15991)

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Enqueue process failed.
Aborted (core dumped)

I will disable the OpenCL CPU backend in the e2e test, linking this issue in a comment, so it can be re-enabled when the problem is solved.

The minimal reproducer for the issue is:

#include <sycl/sycl.hpp>
#include <array>

constexpr size_t PatternSize{32}; // bytes
constexpr size_t NumElements{10};

int main() {
  sycl::queue q{};
  using T = std::array<uint8_t, PatternSize>;
  T value{};

  T *dptr{sycl::malloc_device<T>(NumElements, q)};
  q.fill(dptr, value, NumElements).wait();
  sycl::free(dptr, q);

  return 0;
}

compiled and ran with:

clang++ -fsycl -fsycl-targets=spir64_x86_64 -o mini minimal.cpp
ONEAPI_DEVICE_SELECTOR=opencl:cpu ./mini

Debugging further with this UR change:

diff --git a/source/adapters/opencl/usm.cpp b/source/adapters/opencl/usm.cpp
index dfcc1dfa..ed18659e 100644
--- a/source/adapters/opencl/usm.cpp
+++ b/source/adapters/opencl/usm.cpp
@@ -276,12 +276,15 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueUSMFill(
         cl_ext::getExtFuncFromContext<clEnqueueMemFillINTEL_fn>(
             CLContext, cl_ext::ExtFuncPtrCache->clEnqueueMemFillINTELCache,
             cl_ext::EnqueueMemFillName, &EnqueueMemFill));
-
-    CL_RETURN_ON_FAILURE(
+    CLErr =
         EnqueueMemFill(cl_adapter::cast<cl_command_queue>(hQueue), ptr,
                        pPattern, patternSize, size, numEventsInWaitList,
                        cl_adapter::cast<const cl_event *>(phEventWaitList),
-                       cl_adapter::cast<cl_event *>(phEvent)));
+                       cl_adapter::cast<cl_event *>(phEvent));
+    std::cout << "EnqueueMemFillINTEL(patternSize=" << patternSize << ") return code " << CLErr << std::endl;
+    if (CLErr != CL_SUCCESS) {
+      return mapCLErrorToUR(CLErr);
+    }
     return UR_RESULT_SUCCESS;
   }

confirmed that it's the clEnqueueMemFillINTEL call which sometimes returns -30 (CL_INVALID_VALUE).

EnqueueMemFillINTEL(patternSize=32) return code -30
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Enqueue process failed.
Aborted (core dumped)

I noticed that in my build of UR / DPC++ this happens when the binary name is shorter than 24 characters, but stops happening when it is longer. Simply renaming the file changes the behaviour. In another build I got the opposite behaviour where short-named binary succeeds but long-named fails. I assume what happens is that long file name causes heap allocation for argv[0] and shifts the memory layout, and thus alignment of the device allocation.

$ # short name fails
$ ./mini
EnqueueMemFillINTEL(patternSize=32) return code -30
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Enqueue process failed.
Aborted (core dumped)

$ # rename the file
$ cp mini mini12341234123412341234

$ # long name succeeds
$ ./mini12341234123412341234
EnqueueMemFillINTEL(patternSize=32) return code 0

I note that clEnqueueMemFillINTEL as described here: https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_unified_shared_memory.html may return:

CL_INVALID_VALUE if dst_ptr is NULL, or if dst_ptr is not aligned to pattern_size bytes

so I assume this is what happens as I checked the other conditions for returning CL_INVALID_VALUE are not met.

IIUC neither SYCL API nor UR API make any requirements about the destination memory alignment for their USM fill functions, therefore it is incorrect for the implementation to assume alignment. I think the solution here could be to check the alignment and take the other (slower) path which doesn't call clEnqueueMemFillINTEL when the alignment requirement is not met.

Side note: something seems to be lost in error handling here as the user is informed neither about the error code (INVALID_VALUE) nor its origin (USM fill). There is only a generic exception thrown by the SYCL runtime.

rafbiels avatar Dec 09 '24 14:12 rafbiels

I'm reverting the test addition in https://github.com/intel/llvm/pull/16465, please ensure it's properly re-committed before closing this issue.

aelovikov-intel avatar Dec 23 '24 23:12 aelovikov-intel

The test was added back with https://github.com/intel/llvm/pull/16544 now properly marked as UNSUPPORTED: (opencl && cpu). The issue is still there and the test can be used to validate future fixes (in addition to the reproducer above).

rafbiels avatar Jan 08 '25 15:01 rafbiels

It's been shown to fail also on the opencl:fpga target and on a second thought there is no reason at UR level why it wouldn't affect GPU as well, even if it hasn't been observed. The UR code is the same for any device type, so I believe all can be affected depending on how the driver allocates memory. https://github.com/intel/llvm/pull/16588 marked the test UNSUPPORTED for all OpenCL devices.

rafbiels avatar Jan 14 '25 10:01 rafbiels

This is not fixed, the reproducer still shows the issue with intel/llvm nightly-2025-05-07 and the following OpenCL CPU device:

[opencl:cpu][opencl:0] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i9-12900K OpenCL 3.0 (Build 0) [2025.19.4.0.18_160000.xmain-hotfix]
$ clang++ --version
clang version 21.0.0git (https://github.com/intel/llvm 1e5d85da1c4e5a7bc62948a99f13885b565b0c14)
$ clang++ -fsycl -fsycl-targets=spir64_x86_64 -o mini minimal.cpp
$ ONEAPI_DEVICE_SELECTOR=opencl:cpu ./mini
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Enqueue process failed.
Aborted (core dumped)

It looks like the CI configuration has changed and the affected test is no longer being run on the affected device. It was originally failing in Post Commit CI on OpenCL CPU and FPGA devices, but the Post Commit looks to run on GPU only now: https://github.com/intel/llvm/actions/runs/14880949602

rafbiels avatar May 07 '25 11:05 rafbiels

@rafbiels I ran the actual failing e2e test and it passed on my machine with the same version of OpenCL as in your comment, is that something you can check locally too? I didn't realise the CI configuration has changed to not run on CPU I will investigate this. The reproducer you gave does look like it's failing though, it seems like there is a difference between that and the e2e test?

martygrant avatar May 07 '25 14:05 martygrant

Hi @martygrant, as the issue lies in memory alignment, any small change to host allocation prior to the to-be-filled memory may hide or bring up the crash. I found that just renaming the binary has this effect as the size of argv[0] influences how the subsequent allocations are aligned. I assume version differences of glibc, libstdc++ and other libraries may also have this influence, changing the behaviour from one system to another. You could try renaming the test binary locally to see if that triggers the issue for you (or just work with the reproducer).

rafbiels avatar May 07 '25 14:05 rafbiels

I have reverted my PR to re-enable the fill_any_size.cpp test (https://github.com/intel/llvm/pull/18381) as it's passing was just being flaky and a proper fix needs to be investigated.

martygrant avatar May 09 '25 13:05 martygrant

I took a look at the reproducer and it looks like the call to malloc_device:

T *dptr{sycl::malloc_device<T>(NumElements, q)};

is generating this OpenCL call to allocate the memory:

>>>> clDeviceMemAllocINTEL: context = 0x1185dab8, device = 12th Gen Intel(R) Core(TM) i9-12900K (CL_DEVICE_TYPE_CPU) (0x103d1498), properties = (nil), size = 320, alignment = 1
<<<< clDeviceMemAllocINTEL: returned 0x102fe810 -> CL_SUCCESS

Since the alignment parameter is set to 1, there is no guaranteed alignment, and the test may fail sporadically.

Passing 1 as the alignment value seems odd here, and it's not coming from the UR. I think it's coming from here?

https://github.com/intel/llvm/blob/ccb079c62ce3cd683bda9a62808cf24859c14908/sycl/include/sycl/usm.hpp#L174

Note, alignof(T) is equal to 1, since it is an array of uint8_t.

I'm not sure if this test needs to use aligned_alloc_device to allocate with a bigger alignment, or if something needs to change with the memory fill, but at least this explains the current behavior.

bashbaug avatar May 20 '25 15:05 bashbaug

Maybe we just need a check here to ensure that the pointer is properly aligned, in addition to the existing check that the pattern size is supported?

https://github.com/oneapi-src/unified-runtime/blob/76898f0c5168841b715002748ae599a9d56f1bb1/source/adapters/opencl/usm.cpp#L272

bashbaug avatar May 20 '25 17:05 bashbaug

Maybe we just need a check here to ensure that the pointer is properly aligned, in addition to the existing check that the pattern size is supported?

unified-runtime/source/adapters/opencl/usm.cpp

Indeed, this was my thinking as well, but I haven't tried it myself

rafbiels avatar May 20 '25 18:05 rafbiels

@bashbaug hi Ben I opened a PR for this last week to check for pointer alignment before using the Intel extension, does this look about right? https://github.com/intel/llvm/pull/18423

martygrant avatar May 21 '25 09:05 martygrant

@rafbiels I also opened a PR to make the error message resulting from this more descriptive https://github.com/intel/llvm/pull/18517

martygrant avatar May 21 '25 09:05 martygrant

Thank you @martygrant and apologies I missed your PRs. Both look good to me!

rafbiels avatar May 21 '25 12:05 rafbiels