OpenCL - buffer creation fails with Integrated Graphics
Description
Attempting to run OpenCL code with Intel Integrated graphics gives the error:
[ RUN ] ProbDistributionsBernoulliCdf.opencl_matches_cpu_small
exception thrown in signature const std::tuple<std::vector<int, std::allocator<int> >, Eigen::Matrix<stan::math::var_value<double, void>, -1, 1, 0, -1, 1> >&]:
unknown file: Failure
C++ exception with description "initialize_buffer: clCreateBuffer CL_INVALID_HOST_PTR: Unknown error -37" thrown in the test body.
Which has also been seen in this forum post.
I believe this is due to the use of the CL_MEM_USE_HOST_PTR flag when creating matrix_cl objects:
buffer_cl_
= cl::Buffer(ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
sizeof(T) * size(), A); // this is always synchronous
Sharing the memory pointer to the GPU appears to cause an issue with integrated graphics that have shared memory for the CPU and GPU, and the CL_MEM_COPY_HOST_PTR flag should be used for these devices instead (and fixes the issue for me locally).
Given that this incurs a copy-cost, we wouldn't want to use it for all cases, only for integrated graphics. I think a simple approach here would be to add a new make/local flag: INTEGRATED_OPENCL, which would request the use of this CL_MEM_COPY_HOST_PTR flag:
#ifdef INTEGRATED_OPENCL
buffer_cl_
= cl::Buffer(ctx, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(T) * size(), A); // this is always synchronous
#else
buffer_cl_
= cl::Buffer(ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
sizeof(T) * size(), A); // this is always synchronous
#endif
@SteveBronder @rok-cesnovar I'm completely new to the OpenCL space so make sure to let me know if there's a better or more performant option here!
Current Version:
v4.4.0
Thanks for finding this! What if we pulled out the bitfield like
constexpr auto copy_or_share = CL_MEM_COPY_HOST_PTR * INTEGRATED_OPENCL | (CL_MEM_USE_HOST_PTR * !INTEGRATED_OPENCL)
Then I think putting that in the mem flags would work?
(I forget bitfield logic and am on phone atm but it's something like that)
Neat! I'll do some testing locally and open a PR
Your bitfield magic seems to work like a charm. I'll finish out running all of the OpenCL tests locally to make sure that the integrated implementation doesn't have any hidden issues
Is this issue supposed to be fixed?
I am experiencing a similar issue with Cmdstan 2.32.2.
Have you used the fix added in the PR? Adding INTEGRATED_OPENCL=true to make/local
Oh I see... This works now!
So there is no way to automatically detect the Integrated Graphics and switch the flag in this case?
Oh I see... This works now!
So there is no way to automatically detect the Integrated Graphics and switch the flag in this case?
Unfortunately not, this would require a bit more of a dedicated build system (e.g., a configure script) then we currently use.
But I've opened a PR to add this makeflag to the doc: https://github.com/stan-dev/docs/pull/649
I'm sorry, but I don't understand: why cannot we detect this at runtime?
It has to be detected at compile-time, not runtime - since it requires different code to be compiled
Well, it's just a different flag to pass to cl::Buffer. This can be easily changed at runtime.
Basically, this would mean keeping the same code, but replacing INTEGRATED_OPENCL with some global variable which is initialized once and for all.
Note the constexpr type for the flags, this means that it has to be known at compile time
I'm looking at the OpenCL documentation:
https://man.opencl.org/clCreateBuffer.html
It doesn't say that flags are constexpr. It's only copy_or_share, which, for some reason, has been set constexpr in this patch. There is no reason not to revert this.
Sorry, I linked to the C OpenCL library. The C++ library is the same: https://github.khronos.org/OpenCL-CLHPP/classcl_1_1_buffer.html#a66ee6c9c837b5f74f02bef8f90459b5c
(Also, another point: is there any reason to avoid the copy cost when we can? It seems like not copying will be costly as well, because the GPU will need to access these data in general RAM.)