llvm icon indicating copy to clipboard operation
llvm copied to clipboard

[SYCL] Program reports it does not contain the kernel requested 0 (CL_SUCESS) when using functor on GPU

Open shiltian opened this issue 6 years ago • 9 comments

Here is a small case to reproduce:

#include <CL/sycl.hpp>

class KernelFunctor {
public:
  void operator()(cl::sycl::nd_item<1> Item) {
    const auto GID = Item.get_global_id();
  }
};

int main() {
  cl::sycl::queue Queue;
  cl::sycl::device Device = Queue.get_device();

  cl::sycl::program Prog(Queue.get_context());

  Prog.build_with_kernel_type<KernelFunctor>();
  auto Kernel = Prog.get_kernel<KernelFunctor>();

  Queue.submit([&](cl::sycl::handler &cgh){
    cgh.parallel_for(cl::sycl::nd_range<1>{16, 8}, Kernel);
  });

  return 0;
}

Compilation command:

$ clang++ -fsycl test.cpp -lOpenCL
$ SYCL_DEVICE_TYPE=GPU ./a.out
terminate called after throwing an instance of 'cl::sycl::invalid_object_error'
  what():  This instance of program does not contain the kernel requested 0 (CL_SUCCESS)
[1]    29269 abort (core dumped)  SYCL_DEVICE_TYPE=GPU ./a.out

Call stack:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff6648801 in __GI_abort () at abort.c:79
#2  0x00007ffff7ad8957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff7adeab6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff7adeaf1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff7aded24 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x0000000000413b6f in cl::sycl::detail::program_impl::get_pi_kernel (this=0x65c000, KernelName="")
    at /home/shilei/Documents/sycl/intel-llvm/deploy/lib/clang/9.0.0/include/CL/sycl/detail/program_impl.hpp:418
#7  0x0000000000413726 in cl::sycl::detail::program_impl::get_kernel<KernelFunctor> (this=0x65c000, PtrToSelf=warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<cl::sycl::detail::program_impl, std::allocator<cl::sycl::detail::program_impl>, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<cl::sycl::detail::program_impl, std::allocator<cl::sycl::detail::program_impl>, (__gnu_cxx::_Lock_policy)2>'

std::shared_ptr<cl::sycl::detail::program_impl> (use count 2, weak count 0) = {...})
    at /home/shilei/Documents/sycl/intel-llvm/deploy/lib/clang/9.0.0/include/CL/sycl/detail/program_impl.hpp:275
#8  0x000000000040ebf4 in cl::sycl::program::get_kernel<KernelFunctor> (this=0x7fffffffe268)
    at /home/shilei/Documents/sycl/intel-llvm/deploy/lib/clang/9.0.0/include/CL/sycl/program.hpp:99
#9  0x000000000040e163 in main () at test.cpp:17

shiltian avatar Sep 15 '19 19:09 shiltian

Hi @tianshilei1992, The error happens because the device compiler can't find any kernels in such SYCL code. The device compiler extracts kernels as function objects passed to kernel invocation methods (e.g. parallel_for) as parameters. Here you don't pass a function object to the parallel_for but cl::sycl::kernel object instead, so we can see it. I'd say that there is no chance for the compiler to see a device code in this particular sample because the compiler checks only kernel invocation methods. SYCL spec seems a little bit unclear for me here, because I didn't find the place where it definitely says that you can't run SYCL kernel without passing it to a kernel invocation method, but the following two phrases seem like the clues:

The type of the function object and the program object enable the compilation and linking of the kernel in the program class, a priori of its actual invocation as a kernel object (pt 4.8.9.3).

In both single-source and shared-source implementations, a device compiler should detect the kernel invocations (e.g. parallel_for) in the source code and compile the enclosed kernels, storing them with their associated type name (pt 6.2)

Fznamznon avatar Sep 17 '19 07:09 Fznamznon

@Fznamznon Thank you for detailed response. I didn't find any statement in spec saying that my way is wrong either, and I believe not only me will write code in the way.

I've tried the following code and it does work:

#include <CL/sycl.hpp>

class KernelFunctor {
public:
  void operator()(cl::sycl::nd_item<1> Item) {
    const auto GID = Item.get_global_id();
  }
};

int main() {
  cl::sycl::queue Queue;
  cl::sycl::device Device = Queue.get_device();

  cl::sycl::program Prog(Queue.get_context());

  Prog.build_with_kernel_type<KernelFunctor>();
  auto Kernel = Prog.get_kernel<KernelFunctor>();

  Queue.submit([&](cl::sycl::handler &cgh){
    KernelFunctor fn;
    cgh.parallel_for(cl::sycl::nd_range<1>{16, 8}, fn);
  });

  return 0;
}

The only difference is I define a function object in command group scope and pass it to parallel_for instead of cl::sycl::kernel.

shiltian avatar Sep 17 '19 18:09 shiltian

@mkinsner , @jbrodman , what do you think about it? Is the first code correct from spec point of view?

Fznamznon avatar Sep 18 '19 08:09 Fznamznon

is no chance for the compiler to see a device code in this particular sample because the compiler checks only kernel invocation methods

Should build_with_kernel_type be added to the list of calls that identify a kernel?

mkinsner avatar Apr 23 '20 01:04 mkinsner

Who will be responsible for KernelFunctor creation? The original test case is too simple to demonstrate all the questions we might have trying to make it work. If we add accessor type member to the function to make it useful, how this member should be initialized? I think we should require KernelFunctor to parallel_for method and prohibit original code.

bader avatar May 07 '20 12:05 bader

@rolandschulz and I just encountered this independently, and despite knowing about this issue it still took us a very long time to find my mistake.

#include <CL/sycl.hpp>
#include <typeinfo>
#include <cassert>
using namespace sycl;

template <typename T, int dimensions>
using local_accessor = accessor<T, dimensions, access::mode::read_write, access::target::local>;

int main()
{
  queue q;
  int host_data = 0;
  {
    buffer<int,1> device_data(&host_data, 1);
    q.submit([&](handler& h) {

      // Create lambda and capture everything it needs
      auto acc = device_data.get_access<access::mode::read_write>(h);
      auto l = [=](nd_item<3> it) [[cl::reqd_work_group_size(1, 1, 1)]] {
        acc[0]++;
      };

      // Build kernel to enable kernel queries
      program p(q.get_context());
      p.build_with_kernel_type<decltype(l)>();
      kernel k = p.get_kernel<decltype(l)>();

      // Query kernel; reqd_work_group_size must be respected by enqueue
      range<3> local = k.get_work_group_info<info::kernel_work_group::compile_work_group_size>(q.get_device());
      printf("compile_work_group_size = {%lu, %lu, %lu}\n", local[0], local[1], local[2]);

      // Enqueue kernel with valid size
      range<3> global{2, 2, 2};
#if 1 // this fails, because parallel_for doesn't take the lambda as an argument
    h.parallel_for(nd_range<3>(global, local), k);
#else // this works
    h.parallel_for(nd_range<3>(global, local), l);
#endif

    }).wait();
  }
  printf("host_data = %d\n", host_data);
}

I think this fails only because the kernel name doesn't get registered (and doesn't appear in the integration header) unless it is passed to parallel_for. Unfortunately this doesn't trigger a compiler error: the kernel name for the lambda appears to evaluate to the empty string, and everything using KernelInfo uses the default class (instead of an explicit specialization).

I appreciate these use-cases look a little silly, so here's something arguably a little more realistic, where a user is trying to call a kernel constructed by a library:

// Assume these are provided by a library
struct LibraryGEMMFunctor
{
  void operator()(nd_item<3> it) {};
};
kernel get_gemm_kernel(queue& q) {
  program p(q.get_context());
  p.build_with_kernel_type<struct LibraryGEMMFunctor>();
  return p.get_kernel<struct LibraryGEMMFunctor>();
}

int main() {
  queue q;
  kernel gemm = get_gemm_kernel(q);
  q.submit([&](handler& h) {
    range<3> global{2, 2, 2};
    range<3> local{1, 1, 1};
    h.parallel_for(nd_range<3>(global, local), gemm);
  }).wait();
}

I don't think that it's obvious why this doesn't work, and it seems unnatural to force library writers to call parallel_for somewhere in their library.

That said, I do agree with @bader that this raises other interesting questions. If a user constructs a sycl::kernel via OpenCL interoperability, there's a clear mechanism to set its arguments and call it via parallel_for. If a user constructs a sycl::kernel from anything else, setting its arguments seems impossible?

Pennycook avatar May 27 '20 21:05 Pennycook

@Pennycook asked me how SYCL 2020 will handle this with the new Module API. I answered him in email, but I can also add some information here about the SYCL 1.2.1 implementation. As others have noted above, the SYCL 1.2.1 spec is very unclear about how the program object should be used to compile a kernel. In fact, it was only just recently that I think I understand the intent.

First a disclaimer, the information below is based on my understanding of the spec and related documents. I haven't tried running any of this on DPC++, so I don't know if it actually works. I should also note that this area of SYCL is changing entirely in SYCL 2020, so I'm not sure how valuable it is to test the 1.2.1 implementation.

I don't think any of the code snippets above are using the program API correctly. Correct usage requires calling a form of parallel_for that accepts both a lambda function and a kernel object. This form of parallel_for is illustrated in an example in SYCL 1.2.1 section 4.8.9.3 "Defining kernels using program objects".

class MyKernel; // Forward declaration of the name of the lambda functor

cl::sycl::queue myQueue;
cl::sycl::program MyProgram(myQueue.get_context());

/* use the name of the kernel to obtain the associated program */
MyProgram.build_from_name<MyKernel>();

myQueue.submit([&](handler& commandGroup) {
  commandgroup.parallel_for<class MyKernel>(
    cl::sycl::nd_range<2>(range<2>(4, 4),range<2>(1,1)),
    MyProgram.get_kernel<MyKernel>(), // execute the kernel as compiled in MyProgram
    ([=](cl::sycl::nd_item<2> index) {
      //[kernel code]
    }));
});

Notice, in particular, that the second to last parameter to parallel_for is a kernel object (the return from MyProgram.get_kernel()). This form of parallel_for is not documented anywhere else in the spec, which is clearly an oversight. When I look at the DPC++ headers, there is a similar declaration of parallel_for, but the order of the nd_range and kernel parameters is reversed.

I believe this form of parallel_for solves the problem @bader points out about passing arguments. Since the device compiler sees both the lambda and the kernel object, it knows how to pass arguments to the kernel. There is no need for the application set any parameter values explicitly.

You may wonder why SYCL 1.2.1 also defines forms of parallel_for that take a kernel parameters without a lambda. I believe these forms can only be used by applications that create kernels using OpenCL. Such an application would need to set the kernel parameter values explicitly via the handler::set_arg() or handler.set_args() APIs.

gmlueck avatar May 28 '20 14:05 gmlueck

While the sycl::program class is going away, I believe the same problem exists with sycl::kernel_bundle as well. @gmlueck Could you please clarify if anything is going to be changed in the spec?

romanovvlad avatar Sep 07 '22 13:09 romanovvlad

@romanovvlad, can you say more about what problem exists with sycl::kernel_bundle and what is unclear in the SYCL 2020 spec?

There are several examples in the SYCL 2020 spec showing how to use kernel bundles. For example, this one shows how to get a kernel bundle for a specific kernel and then query information about that kernel:

https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_kernel_introspection

As a side effect, that example also pre-builds the kernel prior to submitting it to the queue, which is similar to the original example in this issue.

gmlueck avatar Sep 07 '22 13:09 gmlueck

I think that this issue is not actual anymore with SYCL 2020. From 4.9.4.2. SYCL functions for invoking kernels:

void parallel_for(nd_range<Dimensions> executionRange,
                 const kernel& kernelObject)

This function must only be used to invoke a kernel that was constructed using a backend specific interoperability function or to invoke a device built-in kernel. Attempting to use this function to invoke other kernels throws a synchronous exception with the errc::invalid error code. The precise semantics of this function are defined by each SYCL backend specification, but the intent is that the kernel should be invoked for the specified executionRange.

Throws an exception with the errc::nd_range error code if the global size defined in the associated executionRange defines a non-zero index space which is not evenly divisible by the local size in each dimension.

This invocation function ignores any kernel_bundle that was bound to this command group handler via handler::use_kernel_bundle() and instead implicitly uses the kernel bundle that contains the kernelObject. Throws an exception with the errc::kernel_not_supported error code if the kernelObject is not compatible with either the device associated with the primary queue of the command group or with the device associated with the secondary queue (if specified).

In the original example above, kernel object is constructed through program class, which is now removed. In the spec example mentioned by @gmlueck, kernel object is constructed using kernel_bundle, but it can't be passed into parallel_for according to the spec.

Therefore, I suggest that we close this issue.

Note: I've tried to pass it so simply check what happens in our implementation, i.e. do something like this:

kernel myKernel = myBundle.get_kernel(kernelId);
// ...
myQueue.submit([&](handler& cgh) {
  // Use the kernel bundle we queried, so we are sure the queried work-group
  // size matches the kernel we run.
  cgh.use_kernel_bundle(myBundle);
  cgh.parallel_for<MyKernel>(myRange, myKernel);
});

And I faced a bunch of errors that myKernel is not device copyable, which makes me think that we don't fully support the aforementioned parallel_for overload. But that's a separate issue

AlexeySachkov avatar Feb 06 '23 14:02 AlexeySachkov