vexcl Errors using CL_DEVICE_TYPE_CPU on Mac OSX Yosemite

Running the compiler_bug.cpp from the command line using the following : g++ -o compiler_bug compiler_bug.cpp -std=c++0x -I OpenHeaders -framework OpenCL && ./compiler_bug Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz OpenCL compilation error CVMS_ERROR_COMPILER_FAILURE: CVMS compiler has crashed or hung building an element. clBuildProgram

Running stencil operators on the CPU also causes crashes. The same code runs on the GPU.

I'm testing this on a Macbook Pro with an i7 CPU and a ATI Radeon HD 6750M GPU.

Jan 29 '15 11:01 lajash

Can you share the source of compiler_bug.cpp here?

Jan 29 '15 11:01 ddemidov

Hi Denis,

Its the one from your gist - https://gist.github.com/ddemidov/8681608 https://gist.github.com/ddemidov/8681608

Thanks, Rajesh

On Jan 29, 2015, at 4:54 PM, Denis Demidov [email protected] wrote:

Can you share the source of compiler_bug.cpp here?

— Reply to this email directly or view it on GitHub https://github.com/ddemidov/vexcl/issues/158#issuecomment-72009367.

Jan 29 '15 11:01 lajash

In that case its a known issue (see #92). That issue have never been resolved as far as I know. I would raise a ticket with Apple support, since the code in the gist is standard OpenCL and should work.

Edit: Also, I don't have access to a MacOSX machine, so there is not much I can do here.

Jan 29 '15 11:01 ddemidov

Ah ok.. I’ve looked at #92. Will change the filter to CL_DEVICE_TYPE_GPU for the Mac platform for now. Not sure if this applies to the new Macs though.

Thanks for your help,

Rajesh.

On Jan 29, 2015, at 5:03 PM, Denis Demidov [email protected] wrote:

In that case its a known issue (see #92 https://github.com/ddemidov/vexcl/issues/92). That issue have never been resolved as far as I know. I would raise a ticket with Apple support, since the code in the gist is standard OpenCL and should work.

— Reply to this email directly or view it on GitHub https://github.com/ddemidov/vexcl/issues/158#issuecomment-72010291.

Jan 29 '15 11:01 lajash

Closing this issue as it seems to be a bug in the Apple OpenCL framework. FYI, the 10.10.2 update also does not fix this.

Jan 29 '15 13:01 lajash

Thank you for the info! If you are able to find a workaround, I would be glad to either introduce it to vexcl or accept a pull request.

Jan 29 '15 13:01 ddemidov

Looking into it… will keep you updated.

On Jan 29, 2015, at 6:38 PM, Denis Demidov [email protected] wrote:

Thank you for the info! If you are able to find a workaround, I would be glad to either introduce it to vexcl or accept a pull request.

— Reply to this email directly or view it on GitHub https://github.com/ddemidov/vexcl/issues/158#issuecomment-72020767.

Jan 29 '15 13:01 lajash

Denis, Interestingly, the default programs provided in the Xcode samples all run fine on the CPU. Just looked at the compiler and program options and don't see anything special there either.

Could it be workgroups / queues related ? Never mind, will check it myself without bothering you.

Jan 29 '15 13:01 lajash

I think Apple's OpenCL implementation does not support workgroups of more than one workitem on CPUs (vexcl uses this restriction for kernels on CPU devices), but in the gist the kernel is never launched since it fails the compilation step.

Jan 29 '15 13:01 ddemidov

Here's what works on the CPU thus far ... (taken from your examples, of course )

#include <iostream>
#include <vector>
#include <string>
#include <stdexcept>

#define __CL_ENABLE_EXCEPTIONS
#include <vexcl/vexcl.hpp>

//---------------------------------------------------------------------------
int main() 
{
    const size_t n = 1024 * 1024;
    vex::Context ctx( vex::Filter::Type(CL_DEVICE_TYPE_CPU) );

    std::vector<double> a(n, 1.0);
    std::vector<double> c(n, 0.5);

    std::vector<double> results(n);

    vex::vector<double> A(ctx.queue(), a);
    vex::vector<double> B(ctx.queue(), n);
    vex::vector<double> C(ctx.queue(), c);

    A = (B + C) / 5;
    vex::copy(A, results);
    std::cout << "Size: " << results.size() << std::endl;
    std::cout << "First 2: " << results[0] << "," << results[1] << std::endl;

    VEX_FUNCTION(double, squared_radius, (double, x)(double, y),
    return x * x + y * y;
    );

    A = sqrt(squared_radius(B, C));
    vex::copy(A, results);
    std::cout << "Size: " << results.size() << std::endl;
    std::cout << "First 2: " << results[0] << "," << results[1] << std::endl;    

    VEX_STENCIL_OPERATOR(S, /*return type:*/double, /*window width:*/3, /*center:*/1,
    "return sin(X[0] - X[-1]) + sin(X[1] - X[0]);", ctx);
    A = S(A);

    vex::copy(A, results);
    std::cout << "Size: " << results.size() << std::endl;
    std::cout << "First 2: " << results[0] << "," << results[1] << std::endl;    
}

I'm going to try writing a few custom kernels that use the stencil window to see if I can break anything, Let me know if this makes any sense.

Jan 29 '15 15:01 lajash

I think it would make more sense to run unit tests distributed with vexcl. You can do this with

cd $VEXCL_ROOT
mkdir build
cd build
cmake ..
make
OCL_DEVICE=i7 VEXCL_SHOW_KERNELS=1 make test

After that the test log may be found at Testing/Temporary/LastTest.log. If you upload it to e.g. gist.github.com, we could concentrate on the failing tests.

Jan 29 '15 17:01 ddemidov

Hi Denis, here goes ... https://gist.github.com/lajash/59d9a2f489d2aa05f1e9

Jan 30 '15 06:01 lajash

So the failing kernels I can see are:

In vector/multivector arithmetics:

#if defined(cl_khr_fp64)
#  pragma OPENCL EXTENSION cl_khr_fp64: enable
#elif defined(cl_amd_fp64)
#  pragma OPENCL EXTENSION cl_amd_fp64: enable
#endif

kernel void vexcl_vector_kernel
(
  ulong n,
  global double * prm_1
)
{
  ulong chunk_size  = (n + get_global_size(0) - 1) / get_global_size(0);
  ulong chunk_start = get_global_id(0) * chunk_size;
  ulong chunk_end   = chunk_start + chunk_size;
  if (n < chunk_end) chunk_end = n;
  for(ulong idx = chunk_start; idx < chunk_end; ++idx)
  {
    prm_1[idx] = 42;
  }
}

This is the kernel from the gist above. Btw, I've had another idea worth testing about this kernel, see below.

Boost.Compute integration example (sort function call), which fails due to wrong workgroup size (Apple only supports workgroups with single item on CPUs). I would run unit tests from boost.compute and report any failures to @kylelutz.
FFT test, which could also be due to wrong workgroup size. I'll see if using a workgroup of single item makes any sense there.

About the failing assignment test: it occurred to me this could be due to a type mismatch: prm_1[idx] is double, and 42 is int. Could you please try to compile and run the code from the gist again? I have updated it to use 42.0 instead of 42.

Jan 30 '15 07:01 ddemidov

Will do … will get back to you shortly…

On Jan 30, 2015, at 12:52 PM, Denis Demidov [email protected] wrote:

So the failing kernels I can see are:

In vector/multivector arithmetics: #if defined(cl_khr_fp64)

pragma OPENCL EXTENSION cl_khr_fp64: enable

#elif defined(cl_amd_fp64)

pragma OPENCL EXTENSION cl_amd_fp64: enable

#endif

kernel void vexcl_vector_kernel ( ulong n, global double * prm_1 ) { ulong chunk_size = (n + get_global_size(0) - 1) / get_global_size(0); ulong chunk_start = get_global_id(0) * chunk_size; ulong chunk_end = chunk_start + chunk_size; if (n < chunk_end) chunk_end = n; for(ulong idx = chunk_start; idx < chunk_end; ++idx) { prm_1[idx] = 42; } } This is the kernel from the gist https://gist.github.com/ddemidov/8681608 above. Btw, I've had another idea worth testing about this kernel, see below.

Boost.Compute integration example (sort function call), which fails due to wrong workgroup size (Apple only supports workgroups with single item on CPUs). I would run unit tests from boost.compute and report https://github.com/kylelutz/compute/issues/new any failures to @kylelutz https://github.com/kylelutz. FFT test, which could also be due to wrong workgroup size. I'll see if using a workgroup of single item makes any sense there. About the failing assignment test: it occurred to me this could be due to a type mismatch: prm_1[idx] is double, and 42 is int. Could you please try to compile and run the code from the gist https://gist.github.com/ddemidov/8681608 again? I have updated https://gist.github.com/ddemidov/8681608#file-compiler_bug-cpp-L84 it to use 42.0 instead of 42.

— Reply to this email directly or view it on GitHub https://github.com/ddemidov/vexcl/issues/158#issuecomment-72163707.

Jan 30 '15 07:01 lajash

After using the updated compiler_bug.cpp from your gist, Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz OpenCL compilation error CVMS_ERROR_SERVICE_FAILURE: CVMS compiler has crashed or hung managing the service. clBuildProgram

So basically, it makes no difference.

Jan 30 '15 07:01 lajash

I don't see anything wrong with this kernel, and it does work with any other OpenCL platform I have access to. I believe nothing left here but opening an issue with Apple support.

Regarding the FFT issue: it does work correctly when workgroup size is set set to 1. So could you please check if fft tests are passing for you with branch issue-158-fft?

Note however that (according to examples/fft_benchmark.cpp) VexCL's implementaion of FFT is about two orders of magnitude slower that fftw on a CPU, so there is probably no reason to use it with a CPU anyway.

Jan 30 '15 07:01 ddemidov

You're right .... there's an issue with the quantum of data being transferred. See this gist https://gist.github.com/lajash/1645b473676633b35d9e

NDEnqueKernel issue with larger dataset.

Jan 30 '15 08:01 lajash

Checking fft now ... you're right, makes no sense to use vex::fft if no GPU involved. Will you be working on optimizing it in the future ?

Jan 30 '15 08:01 lajash

Re fft optimization: I don't think it makes sense when fftw is available. On a CPU one can just map the device memory to a host pointer and then use fftw (or any other host-side algorithm) on a device vectors (see the example here). Also, the FFT implementation was provided by @neapel, so he could probably chime in here.

Jan 30 '15 08:01 ddemidov

New test log added here.... https://gist.github.com/lajash/991c1bd6a1fc9d3ffa95

Doesn't look like it fixed anything though. :(

Jan 30 '15 08:01 lajash

Quick update ... the code at https://gist.github.com/lajash/1645b473676633b35d9e runs on my Mac now ... just tried running it multiple times and voila, it runs in 1 out of 4 tries ... but it's extremely slow !! So it looks like its Apple's icd that may have a issue with the CPU . This works on all other platforms I presume ?

Jan 30 '15 08:01 lajash

Your result vector is 100 times less in size than A. So you should get an out of boundary error and a segfault here.

Stencil operators use slow path when run on a CPU, which may be significant for wide stencils like yours. Function based variant works 1.5 faster on my CPU than stencil-based one, and only slightly slower (about 15%) on a GPU.

Jan 30 '15 09:01 ddemidov

Thanks Denis …. will use function variants …

On Jan 30, 2015, at 2:36 PM, Denis Demidov [email protected] wrote:

Your result vector is 100 times less https://gist.github.com/lajash/1645b473676633b35d9e#file-stdev-cpp-L18 in size than A. So you should get an out of boundary error and a segfault here https://gist.github.com/lajash/1645b473676633b35d9e#file-stdev-cpp-L55.

Stencil operators use slow path when run on a CPU, which may be significant for wide stencils like yours. Function based variant https://gist.github.com/ddemidov/4c126b012e4ebf669b51#file-stdev-cpp-L64-L83 works 1.5 faster on my CPU than stencil-based one, and only slightly slower (about 15%) on a GPU.

— Reply to this email directly or view it on GitHub https://github.com/ddemidov/vexcl/issues/158#issuecomment-72172812.

Jan 30 '15 10:01 lajash

vexcl vexcl copied to clipboard

Errors using CL_DEVICE_TYPE_CPU on Mac OSX Yosemite

pragma OPENCL EXTENSION cl_khr_fp64: enable

pragma OPENCL EXTENSION cl_amd_fp64: enable

vexcl
vexcl copied to clipboard