OpenCL-CLHPP
OpenCL-CLHPP copied to clipboard
Problem using SVMAllocator with multiple Context
Hi, I'm trying to use cl::SVMAllocator
on two different platforms and encounter some problems. I created two cl::Context
and construct cl::coarse_svm_vector
using the following code:
#include <iostream>
#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_TARGET_OPENCL_VERSION 200
#include <CL/opencl.hpp>
int main()
{
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl::Platform platform_0 = platforms[0], platform_1 = platforms[1];
std::vector<cl::Device> devices;
platform_0.getDevices(CL_DEVICE_TYPE_ALL, &devices);
cl::Device device_0 = devices[0];
platform_1.getDevices(CL_DEVICE_TYPE_ALL, &devices);
cl::Device device_1 = devices[0];
cl::Context context_0(device_0), context_1(device_1);
// cl::Context::setDefault(context_0);
cl::SVMAllocator<int, cl::SVMTraitCoarse<>> alloc_0(context_0), alloc_1(context_1);
const int n = 10;
cl::coarse_svm_vector<int> vec_0(n, 0, alloc_0);
std::cerr << "Debug 0" << std::endl;
cl::coarse_svm_vector<int> vec_1(n, 1, alloc_1);
std::cerr << "Debug 1" << std::endl;
return 0;
}
I found that if I enable cl::Context::setDefault
, I will get Debug 0
, followed by a Segmentation Fault
. Without cl::Context::setDefault
, the program will go Segmentation Fault
directly.
I did some investigation and found that Segmentation Fault
is caused by enqueueMapSVM
in cl::SVMTraitCoarse:: allocate
:
#0 0x00007ffff7833424 in pthread_mutex_lock () from /usr/lib/libpthread.so.0
#1 0x00007ffff64f2fd6 in ?? () from /usr/lib/libnvidia-opencl.so.1
#2 0x000055555555cbbf in cl::CommandQueue::enqueueMapSVM<int> (this=0x7fffffffe640, ptr=0x7fffd4400000, blocking=1, flags=3, size=40, events=0x0, event=0x0) at /usr/include/CL/opencl.hpp:8170
#3 0x000055555555c8ad in cl::enqueueMapSVM<int> (ptr=0x7fffd4400000, blocking=1, flags=3, size=40, events=0x0, event=0x0) at /usr/include/CL/opencl.hpp:9308
#4 0x000055555555c428 in cl::SVMAllocator<int, cl::SVMTraitCoarse<cl::SVMTraitReadWrite<cl::detail::SVMTraitNull> > >::allocate (this=0x7fffffffe860, size=10) at /usr/include/CL/opencl.hpp:3715
#5 0x000055555555bc99 in std::allocator_traits<cl::SVMAllocator<int, cl::SVMTraitCoarse<cl::SVMTraitReadWrite<cl::detail::SVMTraitNull> > > >::allocate (__a=..., __n=10) at /usr/include/c++/11.1.0/bits/alloc_traits.h:314
#6 0x000055555555b68e in std::_Vector_base<int, cl::SVMAllocator<int, cl::SVMTraitCoarse<cl::SVMTraitReadWrite<cl::detail::SVMTraitNull> > > >::_M_allocate (this=0x7fffffffe860, __n=10) at /usr/include/c++/11.1.0/bits/stl_vector.h:346
#7 0x000055555555aebf in std::_Vector_base<int, cl::SVMAllocator<int, cl::SVMTraitCoarse<cl::SVMTraitReadWrite<cl::detail::SVMTraitNull> > > >::_M_create_storage (this=0x7fffffffe860, __n=10) at /usr/include/c++/11.1.0/bits/stl_vector.h:361
#8 0x000055555555a299 in std::_Vector_base<int, cl::SVMAllocator<int, cl::SVMTraitCoarse<cl::SVMTraitReadWrite<cl::detail::SVMTraitNull> > > >::_Vector_base (this=0x7fffffffe860, __n=10, __a=...) at /usr/include/c++/11.1.0/bits/stl_vector.h:305
#9 0x0000555555559057 in std::vector<int, cl::SVMAllocator<int, cl::SVMTraitCoarse<cl::SVMTraitReadWrite<cl::detail::SVMTraitNull> > > >::vector (this=0x7fffffffe860, __n=10, __value=@0x7fffffffe880: 0, __a=...) at /usr/include/c++/11.1.0/bits/stl_vector.h:524
#10 0x00005555555565c0 in main () at test.cpp:20
In Line 3717 of opencl.hpp
, cl::SVMTraitCoarse:: allocate
will call enqueueMapSVM
using default cl::CommandQueue
no matter what cl::Context
is passed to cl::SVMTraitCoarse.context_
, I think maybe this is the cause of this problem:
// Line 3717
// If allocation was coarse-grained then map it
if (!(SVMTrait::getSVMMemFlags() & CL_MEM_SVM_FINE_GRAIN_BUFFER)) {
cl_int err = enqueueMapSVM(retValue, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, size*sizeof(T));
Now I'm wandering
- Is there any problem using
cl::SVMAllocator
with acl::Context
different from default context? - How to use
cl::SVMAllocator
with multiple contexts correctly?
Some information about my devices:
Number of platforms 2
Platform Name NVIDIA CUDA
Platform Version OpenCL 3.0 CUDA 11.4.112
Device Name NVIDIA GeForce GTX 1660
Platform Name Intel(R) CPU Runtime for OpenCL(TM) Applications
Platform Version OpenCL 2.1 LINUX
Device Name AMD Ryzen 7 3700X 8-Core Processor
Thanks a lot for any help from anyone in advance.
I took a look at this. Here is what I think is happening:
- There are two SVM allocators with two different contexts.
- The allocation itself (
clSVMAlloc
) is done using the context provided by the allocator. - Because this is a a coarse grain SVM allocator, when constructing the coarse_svm_vector the C++ bindings map the SVM allocation for access on the host.
- Mapping the SVM allocation requires a command queue (for
clEnqueueSVMMap
). Currently the C++ bindings use the "default" command queue to do this. - The "default" command queue is created from the "default" context. If the "default" context doesn't exist, the C++ bindings will create it also, where the default context is created against the default device in platform 0.
So:
-
If there is no default context set, a third context gets created (the "default" context), and the default command queue is created from it. The
svm_ptr
passed toclEnqueueSVMMap
has been allocated from a different context than the context created for the command queue, and according to the spec this is undefined behavior:If svm_ptr is allocated using clSVMAlloc then it must be allocated from the same context from which command_queue was created. Otherwise the behavior is undefined.
-
If the first context is set as the default context, then the contexts will match for the first coarse_svm_vector, but it won't match for the second, so there is still undefined behavior.
Because there is just one default context and default command queue I don't currently see an easy way to make this case work with coarse-grain SVM allocations. Possible not-so-easy solutions could be: track a default command queue per context (or per-platform?) and choose it based on the allocator context? Or, create a command queue for the allocator based on the allocator context and use it instead of the default command queue?
I am having a similar issue on a Centos + CUDA platform with V100 and A100 GPUs, although I have only one context. Explicitly setting this context as the default did not help in my case. The bug only happens with coarse grain buffers. However, I can fix the problem by setting the queue I use as the default one as follows:
// Initialize OpenCL
cl::Device device = cl::Device::getDefault();
cl::Context context(device);
cl::CommandQueue queue(context, device);
// -----------THE PROGRAM SEGFAULTS IF THIS IS COMMENTED OUT-----------
cl::CommandQueue::setDefault(queue);
// Compile OpenCL program for found device
cl::Program program(context, kernel_source);
program.build(device);
cl::Kernel kernel_reduce(program, "reduce");
{
// Set problem dimensions
unsigned n = 10;
// Create SVM buffer for sum
cl::SVMAllocator<int, cl::SVMTraitReadWrite<>> svmAlloc(context);
int *sum = svmAlloc.allocate(1);
...
Likely another command queue is created elsewhere if you don't set yours as the default, so your fix is likely "just" having a single command queue which always worked. The issue is with trying to control GPUs individually which doesn't seem to be possible with SVM at least with Nvidia.
Nvidia's SVM support generally seems to be quite poor with not just such surprise crashes which can't be even feasibly inspected by the community due to the binary blobs, but there are also silly driver decisions which can't be overridden, and some of the hardware features are likely intentionally not even exposed outside of CUDA. I gave up on SVM usage for Nvidia because generally the driver just allocated memory on both the host and the card, and it ended up just passing the whole content back and forth incurring massive penalties.
I'm personally looking forward to SYCL being the possible solution to such issues as a CUDA and OpenCL successor. OpenCL requirements were never really high, expectations got even more lax with 3.0, and apparently a manufacturer can claim compliance with an incredibly buggy and arguably intentionally bad implementation. CUDA was always sketchy as a vendor lock-in, and support of it is lately really expensive with the only manufacturer pricing hardware according to what the monopoly position allows, and CUDA code not working elsewhere for double the effort. I usually try to go for well established standards, but Nvidia support is so bad that sometimes code just needs separate CUDA support to work on Nvidia devices while AMD is reasonably good (once the driver installation got wrestled with), and Intel is often my primary choice for testing correctness.