llvm [SYCL][COMPAT] Launch kernels using the enqueue functions extensions

To support launching kernels with compile time known kernel properties and runtime / compile time known launch properties, this PR adds new launch overloads in a new syclcompat::experimental namespace, making use of the following 3 extensions -

SYCL_EXT_ONEAPI_KERNEL_PROPERTIES
SYCL_EXT_ONEAPI_PROPERTIES
SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS

May 02 '24 15:05 AD2605

@AD2605 thanks a lot for this contribution. It's a useful addition, and it paves the way for eventually incorporating kernel_properties into the main launch APIs. Detailed review to follow, but for now can I suggest that the code in launch_experimental.hpp could just be incorporated into launch.hpp directly, within the syclcompat::experimental namespace?

May 06 '24 08:05 joeatodd

@AD2605 thanks a lot for this contribution. It's a useful addition, and it paves the way for eventually incorporating kernel_properties into the main launch APIs. Detailed review to follow, but for now can I suggest that the code in launch_experimental.hpp could just be incorporated into launch.hpp directly, within the syclcompat::experimental namespace?

Alternatively, if you are keen to introduce all this functionality, we can do so, so long as it's tested, and on the understanding that the API might change once we've reviewed the launch API in general.

May 06 '24 14:05 joeatodd

I think that there needs to be some kind of specialization that will call a new unified runtime function from the new UR cuda plugin extension I'm adding that calls cudaLaunchKernelExC with the cluster dimensions. I'm not sure if you've added this already somewhere in this code?

e·g. is there/ do you plan to add an specialization of launch/parallel_for that can specialize for the

properties  cluster_launch_property{ClusterRange(1, 2, 1)};

argument that you have here: https://github.com/intel/llvm/pull/13594/files#diff-96a41bacbe4aca8737244a37e62f63c18fccd2274588d37c26ca421f2fb857a0R140

Thanks

May 13 '24 18:05 JackAKirk

Hi @JackAKirk , thanks for having a look at this PR.

I did a little digging after your comment, (I have not looked into implementing the UR Side) So a parallel_for overload already exists here which accepts the property list , and the calls the overloaded parallel_for_impl. Over there, we can check if the property list contains the property ClusterRange and then call the UR function you are adding.

This would also mean one can launch a kernel with cluster as

cgh.parallel_for(nd_range(...), sycl::ext::oneapi::properties{ClusterRange(...)}, [=](nd_item<Dim>{}));

I did not know this parallel for overload existed. What I do not see however, is the overloads introduced in sycl_ext_oneapi_enqueue_functions calling this overload even when properties are mentioned, and even the tests added does not seem to test launch with properties ? (https://github.com/intel/llvm/pull/13512/files#diff-f6b7355d29c87088898f102554c5a82ed290c8261ab55c0c06adb3af7a9ac932)

But yeah to answer your question, a new overload will not be required, but just a specialization of the parallel_for_impl which accepts the properties, and possibly a bug fix in the sycl_ext_oneapi_enqueue_functions ?

May 14 '24 06:05 AD2605

@JackAKirk, we're planning to overhaul the launch API prior to the 2025.0 release, largely in order to be able to accept whatever kernel and launch properties the user might specify in some kind of struct. So, assuming the cluster_launch_property can be used similarly to other launch properties, this shouldn't be a problem.

May 14 '24 07:05 joeatodd

It looks like the most natural way to plumb it to UR would be to follow what happens for cooperative kernels, e.g. add a bool e.g. MImpl->MKernelIsCustom similar to MImpl->MKernelIsCooperative https://github.com/intel/llvm/blob/af65855fa6b6df0eded078bd3dbe3bf4a6a2b2e3/sycl/source/handler.cpp#L311 ,along with the additional kernel parameters, then this logic eventually makes its way to this function https://github.com/intel/llvm/blob/af65855fa6b6df0eded078bd3dbe3bf4a6a2b2e3/sycl/source/detail/scheduler/commands.cpp#L2369 where it is used to call the appropriate the pi wrapper function e.g. piEnqueueKernelLaunch, for the UR kernel launch function urEnqueueKernelLaunch. I will be making an extension for a new UR function e.g. urEnqueueKernelLaunchCustom that calls cuLaunchKernelExC in the cuda adapter. There needs to be the logic like I described above to distinguish when a cluster size is passed such that urEnqueueKernelLaunchCustom is called instead, similar to how the MKernelIsCooperative bool is currently used.

May 14 '24 09:05 JackAKirk

It looks like the most natural way to plumb it to UR would be to follow what happens for cooperative kernels, e.g. add a bool e.g. MImpl->MKernelIsCustom similar to MImpl->MKernelIsCooperative

https://github.com/intel/llvm/blob/af65855fa6b6df0eded078bd3dbe3bf4a6a2b2e3/sycl/source/handler.cpp#L311 ,along with the additional kernel parameters, then this logic eventually makes its way to this function

https://github.com/intel/llvm/blob/af65855fa6b6df0eded078bd3dbe3bf4a6a2b2e3/sycl/source/detail/scheduler/commands.cpp#L2369 where it is used to call the appropriate the pi wrapper function e.g. piEnqueueKernelLaunch, for the UR kernel launch function urEnqueueKernelLaunch. I will be making an extension for a new UR function e.g. urEnqueueKernelLaunchCustom that calls cuLaunchKernelExC in the cuda adapter. There needs to be the logic like I described above to distinguish when a cluster size is passed such that urEnqueueKernelLaunchCustom is called instead, similar to how the MKernelIsCooperative bool is currently used.

One question I had was whether you can have cooperative kernels and set launch time cluster size at the same time. It turns out that you can. Whilst their interfaces are quite different, functionally cuLaunchCooperativeKernel is a subset of cuLaunchKernelEx. I think this is possibly going to provide a natural resolution of the issues I described above:

I imagine the intention of the CUDA api is that cuLaunchKernelEx will replace cuLaunchCooperativeKernel going forward, as a more general and extensible method of providing launch time configuration for cooperative kernels/ custom distributed shared memory "cluster group" config, or anything else.
We should sync with other backend stakeholders asap, but I think in turn it would be natural if the new UR api that maps to cuLaunchKernelEx could also eventually replace urEnqueueCooperativeKernelLaunch.

This would then resolve the issues raised, because all backends could switch to using the new "launch-time-kernel" UR interface that I will add, and the logic of dpc++ can generalize the MImpl->MKernelIsCooperative bool to something more general and appropriately named e.g. MImpl->MKernelIsLaunchTimeConfig

May 14 '24 10:05 JackAKirk

Closing this for now as we went another way.

Jun 05 '24 09:06 joeatodd