compute-runtime icon indicating copy to clipboard operation
compute-runtime copied to clipboard

clEnqueueReadBuffer fails on IGPUs for mapped host-buffer desinations

Open FreddieWitherden opened this issue 1 month ago • 4 comments

Consider the following snippet which I believe to be a valid use of the OpenCL API:

#include <CL/cl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define CHECK(err, msg) \
if (err != CL_SUCCESS) { \
    fprintf(stderr, "%s failed (%d)\n", msg, err); \
    exit(1); \
}

int main(void) {
    cl_int err;

    cl_platform_id platform;
    CHECK(clGetPlatformIDs(1, &platform, NULL), "clGetPlatformIDs");

    cl_device_id device;
    CHECK(clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, NULL),
          "clGetDeviceIDs");

    cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);
    CHECK(err, "clCreateContext");

    cl_command_queue queue = clCreateCommandQueueWithProperties(context, device, 0, &err);
    CHECK(err, "clCreateCommandQueue");

    const size_t N = 16;
    const size_t bytes = N * sizeof(float);
    cl_mem dev_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, bytes, NULL, &err);
    CHECK(err, "clCreateBuffer dev_buf");

    float pattern = 42.0f; // fill value
    CHECK(clEnqueueFillBuffer(queue, dev_buf, &pattern, sizeof(float),
                              0, bytes, 0, NULL, NULL),
          "clEnqueueFillBuffer");

    cl_mem host_buf = clCreateBuffer(context,
                                     CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR,
                                     bytes, NULL, &err);
    CHECK(err, "clCreateBuffer host_buf");

    void *host_ptr = clEnqueueMapBuffer(queue, host_buf, CL_TRUE,
                                        CL_MAP_WRITE, 0, bytes,
                                        0, NULL, NULL, &err);
    CHECK(err, "clEnqueueMapBuffer");

    err = clEnqueueReadBuffer(queue, dev_buf, CL_FALSE, 0, bytes, host_ptr,
                              0, NULL, NULL);
    CHECK(err, "clEnqueueReadBuffer (non-blocking)");

    CHECK(clFinish(queue), "clFinish");

    CHECK(clEnqueueUnmapMemObject(queue, host_buf, host_ptr, 0, NULL, NULL),
          "clEnqueueUnmapMemObject");
    clFinish(queue);

    clReleaseMemObject(dev_buf);
    clReleaseMemObject(host_buf);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    return 0;
}

Running this on an NVIDIA GPU or an Intel A770m works as expected. However, on my IGPU (TigerLake-H GT1) it fails with:

clEnqueueReadBuffer (non-blocking) failed (-5)

Changing to an ordinary buffer (from malloc) appears to work, as does changing to blocking reads. My runtime version is 25.40.35563.4 and I am on a 6.17.5 kernel with the i915 module.

FreddieWitherden avatar Nov 10 '25 18:11 FreddieWitherden

Hi @FreddieWitherden,

Thanks for your contribution. We made an attempt to recreate the issue, on our side, test is behaving correctly. Our setup was:

  • driver - 25.40.35563.4 - https://github.com/intel/compute-runtime/releases/tag/25.40.35563.4
  • OS - Linux Ubuntu 24.04.3 LTS
  • kernel - 6.17.5-061705-generic
  • logs:
00:02.0 VGA compatible controller: Intel Corporation TigerLake-H GT1 [UHD Graphics] (prog-if 00 [VGA controller])
...
>>>> [279965083374] clGetDeviceIDs: platform = 0x1d7184c0 deviceType = 1 numEntries = 1 devices = 0x7ffecfc901a8 numDevices = 0
<<<< [279965096232] clGetDeviceIDs [2101 ns] -> CL_SUCCESS (0)
>>>> [279965105757] clCreateContext: properties = 0 numDevices = 1 devices = 0x7ffecfc901a8 funcNotify = 0 userData = 0 errcodeRet = 0x7ffecfc90194
<<<< [279966181417] clCreateContext [1065808 ns] result = 0x1edff738 -> CL_SUCCESS (0)
>>>> [279966201226] clCreateCommandQueueWithProperties: context = 0x1edff738 device = 0x1d84cc08 properties = 0 errcodeRet = 0x7ffecfc90194
<<<< [279966464678] clCreateCommandQueueWithProperties [256000 ns] result = 0x1ed74318 -> CL_SUCCESS (0)
>>>> [279966475769] clCreateBuffer: context = 0x1edff738 flags = 1 size = 64 hostPtr = 0 errcodeRet = 0x7ffecfc90194
<<<< [279966502979] clCreateBuffer [18401 ns] result = 0x1ee281c8 -> CL_SUCCESS (0)
>>>> [279966983258] clEnqueueFillBuffer: commandQueue = 0x1ed74318 buffer = 0x1ee281c8 pattern = 0x7ffecfc901a4 patternSize = 4 offset = 0 size = 64 numEventsInWaitList = 0 eventWaitList = 0 event = 0
<<<< [279967050146] clEnqueueFillBuffer [51764 ns] -> CL_SUCCESS (0)
>>>> [279967071998] clCreateBuffer: context = 0x1edff738 flags = 17 size = 64 hostPtr = 0 errcodeRet = 0x7ffecfc90194
<<<< [279967087438] clCreateBuffer [6962 ns] result = 0x1ee2ae78 -> CL_SUCCESS (0)
>>>> [279967097351] clEnqueueMapBuffer: commandQueue = 0x1ed74318 buffer = 0x1ee2ae78 blockingMap = 1 mapFlags = 2 offset = 0 cb = 64 numEventsInWaitList = 0 eventWaitList = 0 event = 0 errcodeRet = 0x7ffecfc90194
<<<< [279967149995] clEnqueueMapBuffer [42873 ns] result = 0x1ee2b380 -> CL_SUCCESS (0)
>>>> [279967404047] clEnqueueReadBuffer: commandQueue = 0x1ed74318 buffer = 0x1ee281c8 blockingRead = 0 offset = 0 cb = 64 ptr = 0x1ee2b380 numEventsInWaitList = 0 eventWaitList = 0 event = 0
<<<< [279967511127] clEnqueueReadBuffer [12626 ns] -> CL_SUCCESS (0)
>>>> [279967748874] clFinish: commandQueue = 0x1ed74318
<<<< [279968367600] clFinish [351140 ns] -> CL_SUCCESS (0)
>>>> [279968590757] clEnqueueUnmapMemObject: commandQueue = 0x1ed74318 memobj = 0x1ee2ae78 mappedPtr = 0x1ee2b380 numEventsInWaitList = 0 eventWaitList = 0 event = 0
<<<< [279968618506] clEnqueueUnmapMemObject [19520 ns] -> CL_SUCCESS (0)
>>>> [279968627101] clFinish: commandQueue = 0x1ed74318
<<<< [279968669136] clFinish [36575 ns] -> CL_SUCCESS (0)
>>>> [279968693120] clReleaseMemObject: memobj = 0x1ee281c8
<<<< [279968714884] clReleaseMemObject [7926 ns] -> CL_SUCCESS (0)
>>>> [279968730601] clReleaseMemObject: memobj = 0x1ee2ae78
<<<< [279968737977] clReleaseMemObject [2445 ns] -> CL_SUCCESS (0)
>>>> [279968745833] clReleaseCommandQueue: commandQueue = 0x1ed74318
<<<< [279968759961] clReleaseCommandQueue [8298 ns] -> CL_SUCCESS (0)
>>>> [279968768661] clReleaseContext: context = 0x1edff738
<<<< [279968796836] clReleaseContext [22550 ns] -> CL_SUCCESS (0) 

Could you provide more information from your side, such as the specific API call and any relevant dmesg logs? This will help us investigate further.

kgibala avatar Nov 17 '25 07:11 kgibala

I've attached the output from strace which shows the syscalls. I have also been able to reproduce this on an Alder Lake system with its integrated GPU.

trace.txt

There is nothing in dmesg.

FreddieWitherden avatar Nov 17 '25 15:11 FreddieWitherden

Hi @FreddieWitherden,

Thank you for providing the logs. Could you please share additional details about your system configuration? Specifically, are you using a single GPU or multiple GPUs? Additionally, please provide the exact driver version and kernel version you are using. This information will help us better understand your setup and assist you more effectively.

kgibala avatar Nov 19 '25 10:11 kgibala

The runtime is from my package manager:

gentoo ~ # emerge -av intel-compute-runtime

These are the packages that would be merged, in order:

Calculating dependencies... done!
Dependency resolution took 1.86 s (backtrack: 0/20).

[ebuild   R    ] dev-libs/intel-compute-runtime-25.40.35563.4:0/1.6.35563::gentoo  USE="l0 vaapi -disable-mitigations" 0 KiB

Total: 1 package (1 reinstall), Size of downloads: 0 KiB

the kernel is

gentoo ~ # uname -a
Linux gentoo 6.17.5-gentoo-x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 24 09:00:21 CDT 2025 x86_64 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz GenuineIntel GNU/Linux

and I've attached the config (I configured/compiled it myself so let me know if you want me to enable any debugging options).

config.gz

FreddieWitherden avatar Nov 19 '25 18:11 FreddieWitherden