math OpenCL - buffer creation fails with Integrated Graphics

Description

Attempting to run OpenCL code with Intel Integrated graphics gives the error:

[ RUN      ] ProbDistributionsBernoulliCdf.opencl_matches_cpu_small
exception thrown in signature  const std::tuple<std::vector<int, std::allocator<int> >, Eigen::Matrix<stan::math::var_value<double, void>, -1, 1, 0, -1, 1> >&]:
unknown file: Failure
C++ exception with description "initialize_buffer: clCreateBuffer CL_INVALID_HOST_PTR: Unknown error -37" thrown in the test body.

Which has also been seen in this forum post.

I believe this is due to the use of the CL_MEM_USE_HOST_PTR flag when creating matrix_cl objects:

        buffer_cl_
            = cl::Buffer(ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
                         sizeof(T) * size(), A);  // this is always synchronous

Sharing the memory pointer to the GPU appears to cause an issue with integrated graphics that have shared memory for the CPU and GPU, and the CL_MEM_COPY_HOST_PTR flag should be used for these devices instead (and fixes the issue for me locally).

Given that this incurs a copy-cost, we wouldn't want to use it for all cases, only for integrated graphics. I think a simple approach here would be to add a new make/local flag: INTEGRATED_OPENCL, which would request the use of this CL_MEM_COPY_HOST_PTR flag:

#ifdef INTEGRATED_OPENCL
        buffer_cl_
            = cl::Buffer(ctx, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
                         sizeof(T) * size(), A);  // this is always synchronous
#else
        buffer_cl_
            = cl::Buffer(ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
                         sizeof(T) * size(), A);  // this is always synchronous
#endif

@SteveBronder @rok-cesnovar I'm completely new to the OpenCL space so make sure to let me know if there's a better or more performant option here!

Current Version:

v4.4.0

Jul 09 '22 18:07 andrjohns

Thanks for finding this! What if we pulled out the bitfield like

constexpr auto copy_or_share = CL_MEM_COPY_HOST_PTR * INTEGRATED_OPENCL | (CL_MEM_USE_HOST_PTR * !INTEGRATED_OPENCL)

Then I think putting that in the mem flags would work?

Jul 09 '22 18:07 SteveBronder

(I forget bitfield logic and am on phone atm but it's something like that)

Jul 09 '22 18:07 SteveBronder

Neat! I'll do some testing locally and open a PR

Jul 10 '22 18:07 andrjohns

Your bitfield magic seems to work like a charm. I'll finish out running all of the OpenCL tests locally to make sure that the integrated implementation doesn't have any hidden issues

Jul 11 '22 08:07 andrjohns

Is this issue supposed to be fixed?

I am experiencing a similar issue with Cmdstan 2.32.2.

Jul 03 '23 23:07 jhjourdan

Have you used the fix added in the PR? Adding INTEGRATED_OPENCL=true to make/local

Jul 04 '23 05:07 andrjohns

Oh I see... This works now!

So there is no way to automatically detect the Integrated Graphics and switch the flag in this case?

Jul 04 '23 09:07 jhjourdan

Oh I see... This works now!

So there is no way to automatically detect the Integrated Graphics and switch the flag in this case?

Unfortunately not, this would require a bit more of a dedicated build system (e.g., a configure script) then we currently use.

But I've opened a PR to add this makeflag to the doc: https://github.com/stan-dev/docs/pull/649

Jul 06 '23 09:07 andrjohns

I'm sorry, but I don't understand: why cannot we detect this at runtime?

Jul 06 '23 09:07 jhjourdan

It has to be detected at compile-time, not runtime - since it requires different code to be compiled

Jul 06 '23 13:07 andrjohns

Well, it's just a different flag to pass to cl::Buffer. This can be easily changed at runtime.

Basically, this would mean keeping the same code, but replacing INTEGRATED_OPENCL with some global variable which is initialized once and for all.

Jul 06 '23 13:07 jhjourdan

Note the constexpr type for the flags, this means that it has to be known at compile time

Jul 06 '23 13:07 andrjohns

I'm looking at the OpenCL documentation:

https://man.opencl.org/clCreateBuffer.html

It doesn't say that flags are constexpr. It's only copy_or_share, which, for some reason, has been set constexpr in this patch. There is no reason not to revert this.

Jul 06 '23 13:07 jhjourdan

Sorry, I linked to the C OpenCL library. The C++ library is the same: https://github.khronos.org/OpenCL-CLHPP/classcl_1_1_buffer.html#a66ee6c9c837b5f74f02bef8f90459b5c

Jul 06 '23 13:07 jhjourdan

(Also, another point: is there any reason to avoid the copy cost when we can? It seems like not copying will be costly as well, because the GPU will need to access these data in general RAM.)

Jul 06 '23 14:07 jhjourdan