OpenCL-CLHPP Problem using SVMAllocator with multiple Context

Hi, I'm trying to use cl::SVMAllocator on two different platforms and encounter some problems. I created two cl::Context and construct cl::coarse_svm_vector using the following code:

#include <iostream>
#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_TARGET_OPENCL_VERSION 200
#include <CL/opencl.hpp>

int main()
{
    std::vector<cl::Platform> platforms;
    cl::Platform::get(&platforms);
    cl::Platform platform_0 = platforms[0], platform_1 = platforms[1];
    std::vector<cl::Device> devices;
    platform_0.getDevices(CL_DEVICE_TYPE_ALL, &devices);
    cl::Device device_0 = devices[0];
    platform_1.getDevices(CL_DEVICE_TYPE_ALL, &devices);
    cl::Device device_1 = devices[0];
    cl::Context context_0(device_0), context_1(device_1);
    // cl::Context::setDefault(context_0);
    cl::SVMAllocator<int, cl::SVMTraitCoarse<>> alloc_0(context_0), alloc_1(context_1);
    const int n = 10;
    cl::coarse_svm_vector<int> vec_0(n, 0, alloc_0);
    std::cerr << "Debug 0" << std::endl;
    cl::coarse_svm_vector<int> vec_1(n, 1, alloc_1);
    std::cerr << "Debug 1" << std::endl;
    return 0;
}

I found that if I enable cl::Context::setDefault, I will get Debug 0, followed by a Segmentation Fault. Without cl::Context::setDefault, the program will go Segmentation Fault directly.

I did some investigation and found that Segmentation Fault is caused by enqueueMapSVM in cl::SVMTraitCoarse:: allocate:

#0  0x00007ffff7833424 in pthread_mutex_lock () from /usr/lib/libpthread.so.0
#1  0x00007ffff64f2fd6 in ?? () from /usr/lib/libnvidia-opencl.so.1
#2  0x000055555555cbbf in cl::CommandQueue::enqueueMapSVM<int> (this=0x7fffffffe640, ptr=0x7fffd4400000, blocking=1, flags=3, size=40, events=0x0, event=0x0) at /usr/include/CL/opencl.hpp:8170
#3  0x000055555555c8ad in cl::enqueueMapSVM<int> (ptr=0x7fffd4400000, blocking=1, flags=3, size=40, events=0x0, event=0x0) at /usr/include/CL/opencl.hpp:9308
#4  0x000055555555c428 in cl::SVMAllocator<int, cl::SVMTraitCoarse<cl::SVMTraitReadWrite<cl::detail::SVMTraitNull> > >::allocate (this=0x7fffffffe860, size=10) at /usr/include/CL/opencl.hpp:3715
#5  0x000055555555bc99 in std::allocator_traits<cl::SVMAllocator<int, cl::SVMTraitCoarse<cl::SVMTraitReadWrite<cl::detail::SVMTraitNull> > > >::allocate (__a=..., __n=10) at /usr/include/c++/11.1.0/bits/alloc_traits.h:314
#6  0x000055555555b68e in std::_Vector_base<int, cl::SVMAllocator<int, cl::SVMTraitCoarse<cl::SVMTraitReadWrite<cl::detail::SVMTraitNull> > > >::_M_allocate (this=0x7fffffffe860, __n=10) at /usr/include/c++/11.1.0/bits/stl_vector.h:346
#7  0x000055555555aebf in std::_Vector_base<int, cl::SVMAllocator<int, cl::SVMTraitCoarse<cl::SVMTraitReadWrite<cl::detail::SVMTraitNull> > > >::_M_create_storage (this=0x7fffffffe860, __n=10) at /usr/include/c++/11.1.0/bits/stl_vector.h:361
#8  0x000055555555a299 in std::_Vector_base<int, cl::SVMAllocator<int, cl::SVMTraitCoarse<cl::SVMTraitReadWrite<cl::detail::SVMTraitNull> > > >::_Vector_base (this=0x7fffffffe860, __n=10, __a=...) at /usr/include/c++/11.1.0/bits/stl_vector.h:305
#9  0x0000555555559057 in std::vector<int, cl::SVMAllocator<int, cl::SVMTraitCoarse<cl::SVMTraitReadWrite<cl::detail::SVMTraitNull> > > >::vector (this=0x7fffffffe860, __n=10, __value=@0x7fffffffe880: 0, __a=...) at /usr/include/c++/11.1.0/bits/stl_vector.h:524
#10 0x00005555555565c0 in main () at test.cpp:20

In Line 3717 of opencl.hpp, cl::SVMTraitCoarse:: allocate will call enqueueMapSVM using default cl::CommandQueue no matter what cl::Context is passed to cl::SVMTraitCoarse.context_, I think maybe this is the cause of this problem:

        // Line 3717
        // If allocation was coarse-grained then map it
        if (!(SVMTrait::getSVMMemFlags() & CL_MEM_SVM_FINE_GRAIN_BUFFER)) {
            cl_int err = enqueueMapSVM(retValue, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, size*sizeof(T));

Now I'm wandering

Is there any problem using cl::SVMAllocator with a cl::Context different from default context?
How to use cl::SVMAllocator with multiple contexts correctly?

Some information about my devices:

Number of platforms                               2
  Platform Name                                   NVIDIA CUDA
  Platform Version                                OpenCL 3.0 CUDA 11.4.112
    Device Name                                   NVIDIA GeForce GTX 1660
  Platform Name                                   Intel(R) CPU Runtime for OpenCL(TM) Applications
  Platform Version                                OpenCL 2.1 LINUX
    Device Name                                   AMD Ryzen 7 3700X 8-Core Processor

Thanks a lot for any help from anyone in advance.

Aug 26 '21 06:08 ybh1998

I took a look at this. Here is what I think is happening:

There are two SVM allocators with two different contexts.
The allocation itself (clSVMAlloc) is done using the context provided by the allocator.
Because this is a a coarse grain SVM allocator, when constructing the coarse_svm_vector the C++ bindings map the SVM allocation for access on the host.
Mapping the SVM allocation requires a command queue (for clEnqueueSVMMap). Currently the C++ bindings use the "default" command queue to do this.
The "default" command queue is created from the "default" context. If the "default" context doesn't exist, the C++ bindings will create it also, where the default context is created against the default device in platform 0.

So:

If there is no default context set, a third context gets created (the "default" context), and the default command queue is created from it. The svm_ptr passed to clEnqueueSVMMap has been allocated from a different context than the context created for the command queue, and according to the spec this is undefined behavior:

If svm_ptr is allocated using clSVMAlloc then it must be allocated from the same context from which command_queue was created. Otherwise the behavior is undefined.
If the first context is set as the default context, then the contexts will match for the first coarse_svm_vector, but it won't match for the second, so there is still undefined behavior.

Because there is just one default context and default command queue I don't currently see an easy way to make this case work with coarse-grain SVM allocations. Possible not-so-easy solutions could be: track a default command queue per context (or per-platform?) and choose it based on the allocator context? Or, create a command queue for the allocator based on the allocator context and use it instead of the default command queue?

Sep 03 '21 18:09 bashbaug

I am having a similar issue on a Centos + CUDA platform with V100 and A100 GPUs, although I have only one context. Explicitly setting this context as the default did not help in my case. The bug only happens with coarse grain buffers. However, I can fix the problem by setting the queue I use as the default one as follows:

  // Initialize OpenCL
  cl::Device device = cl::Device::getDefault();
  cl::Context context(device);
  cl::CommandQueue queue(context, device);

  // -----------THE PROGRAM SEGFAULTS IF THIS IS COMMENTED OUT-----------
  cl::CommandQueue::setDefault(queue);

  // Compile OpenCL program for found device
  cl::Program program(context, kernel_source);
  program.build(device);
  cl::Kernel kernel_reduce(program, "reduce");

  {
    // Set problem dimensions    
    unsigned n = 10;

    // Create SVM buffer for sum
    cl::SVMAllocator<int, cl::SVMTraitReadWrite<>> svmAlloc(context);
    int *sum = svmAlloc.allocate(1);
    ...

Mar 30 '23 08:03 hokkanen

Likely another command queue is created elsewhere if you don't set yours as the default, so your fix is likely "just" having a single command queue which always worked. The issue is with trying to control GPUs individually which doesn't seem to be possible with SVM at least with Nvidia.

Nvidia's SVM support generally seems to be quite poor with not just such surprise crashes which can't be even feasibly inspected by the community due to the binary blobs, but there are also silly driver decisions which can't be overridden, and some of the hardware features are likely intentionally not even exposed outside of CUDA. I gave up on SVM usage for Nvidia because generally the driver just allocated memory on both the host and the card, and it ended up just passing the whole content back and forth incurring massive penalties.

I'm personally looking forward to SYCL being the possible solution to such issues as a CUDA and OpenCL successor. OpenCL requirements were never really high, expectations got even more lax with 3.0, and apparently a manufacturer can claim compliance with an incredibly buggy and arguably intentionally bad implementation. CUDA was always sketchy as a vendor lock-in, and support of it is lately really expensive with the only manufacturer pricing hardware according to what the monopoly position allows, and CUDA code not working elsewhere for double the effort. I usually try to go for well established standards, but Nvidia support is so bad that sometimes code just needs separate CUDA support to work on Nvidia devices while AMD is reasonably good (once the driver installation got wrestled with), and Intel is often my primary choice for testing correctness.

Mar 30 '23 23:03 voidpointertonull

OpenCL-CLHPP OpenCL-CLHPP copied to clipboard

Problem using SVMAllocator with multiple Context

OpenCL-CLHPP
OpenCL-CLHPP copied to clipboard