llvm icon indicating copy to clipboard operation
llvm copied to clipboard

Create SYCL events for submissions to native queues

Open fknorr opened this issue 9 months ago • 2 comments

Is your feature request related to a problem? Please describe

SYCL allows submitting native backend operations to (in-order) queues without paying the synchronization overhead cost of a host_task through sycl::get_native(queue).

sycl::queue q(device, sycl::property::queue::in_order{});
const auto evt1 = q.submit(/* some SYCL operation */);
auto cuda_stream = sycl::get_native<sycl::backend::ext_oneapi_cuda>(q);
cudaNativeFunctionAsync(..., cuda_stream); // does not wait for evt1 on the host, only on device!
const auto evt2 = ??;

As seen above we can get a sycl::event for the SYCL operation / kernel submission, but not for the cudaNative submission. Such an event would however be desirable so that another operation (on a different queue) could specify a dependency on that exact submission, something which is not possible when manually doing cudaEventRecord.

Describe the solution you would like

Multiple ideas, in descending complexity:

  1. Support for AdaptiveCpp's enqueue_custom_operation, see docs
  2. An extension function sycl::event sycl::queue::ext_record_event() that performs the equivalent of a cudaEventRecord on an in-order queue
  3. A working implementation of sycl::make_event<backend::ext_oneapi_cuda>(cudaEvent_t, context &)
  4. A pointer on what internal function needs to be called as workaround in the meantime

Describe alternatives you have considered

I have attempted

sycl::event record_cuda_event(sycl::queue &queue) {
    const auto stream = sycl::get_native<sycl::backend::ext_oneapi_cuda>(queue);
    cudaEvent_t event;
    cudaEventCreateWithFlags(&event, cudaEventDisableTiming);
    cudaEventRecord(event, stream);
    return sycl::detail::make_event(sycl::detail::pi::cast<pi_native_handle>(event), queue.get_context(), sycl::backend::ext_oneapi_cuda);
}

but the returned event does not make progress when queried using event.get_infosycl::info::event::command_execution_status()`.

Using the official API

return sycl::make_event<sycl::backend::ext_oneapi_cuda>(event, context);

instead fails to compile with

include/sycl/backend.hpp:356:1: note: candidate template ignored: requirement 'detail::InteropFeatureSupportMap<sycl::backend::ext_oneapi_cuda>::MakeEvent == true' was not satisfied [with Backend = sycl::backend::ext_oneapi_cuda]

Additional context

Using host_task as a replacement is not desirable because it needs to wait (on the host) for the previous operations on the (in-order) queue to complete, negating the latency-hiding benefits of eagerly submitting device work in-order.

Please advise if there is any workaround using (non-portable / unstable) internal APIs at the moment to create an event from such a manual submission, or to convert (wrap) a cudaEvent_t to a sycl::event so that other in-order queues may wait on it.

Above code / workarounds was tried with DPC++ e3308557 (May 7, 2024).

fknorr avatar May 08 '24 16:05 fknorr

An extension function sycl::event sycl::queue::ext_record_event() that performs the equivalent of a cudaEventRecord on an in-order queue

Have you considered using sycl_ext_oneapi_enqueue_barrier? sycl::queue::ext_oneapi_submit_barrier() does exactly that, as far as I can tell.

al42and avatar May 17 '24 14:05 al42and

Thanks, I've looked into the implementation of ext_oneapi_submit_barrier and it appears to just return the event from the last SYCL submission that was made to the in-order queue. It does not seem to record new events meaning it is blind to my manual submission through the native queue.

I've found another extension with a promising name, sycl_ext_oneapi_in_order_queue_events which provides queue::ext_oneapi_set_external_event - but the external event is again a SYCL event and I have no way of creating one from a cudaEvent_t, so I'm back to the start.

Thinking about this a some more, I'm beginning to wonder if my example code can even be made thread-safe at all. As far as I understand DPC++ is free to perform its own native-queue submissions of kernels in a background thread and return control immediately, which means that even if CUDA streams themselves are / were thread-safe, I could experience spurious re-orders between a kernel launch in q.submit() and my own subsequent operation on the native queue.

Having an equivalent to AdaptiveCpp's enqueue_custom_operation really seems like the best option to have semantics that are both unambiguous to the user and allow the implementation to see all of the users interop work.

fknorr avatar May 19 '24 06:05 fknorr

Thanks, I've looked into the implementation of ext_oneapi_submit_barrier and it appears to just return the event from the last SYCL submission that was made to the in-order queue. It does not seem to record new events meaning it is blind to my manual submission through the native queue.

Nice catch; I missed it when this optimization was added. But perhaps queue.submit([=](handler &CGH) { CGH.ext_oneapi_barrier(); }); (which ext_oneapi_submit_barrier does on non-optimized path) would still work? The queue::submit seems to always return the new event (although I did not dig through all the layers of abstraction).

Using the official API return sycl::make_event<sycl::backend::ext_oneapi_cuda>(event, context); instead fails to compile with include/sycl/backend.hpp:356:1: note: candidate template ignored: requirement 'detail::InteropFeatureSupportMap<sycl::backend::ext_oneapi_cuda>::MakeEvent == true' was not satisfied [with Backend = sycl::backend::ext_oneapi_cuda]

Funny, there is some experimental CUDA interop header, sycl/include/sycl/ext/oneapi/experimental/backend/backend_traits_cuda.hpp, which supports events (#6202, #6288). It can be activated with -DSYCL_EXT_ONEAPI_BACKEND_CUDA_EXPERIMENTAL; not sure why is it still experimental, given it was added in 2022. It uses the same sycl::detail::make_event function internally, which sets IsRecorded to false thus short-circuiting any logic that uses the sycl::event, so that still won't solve your problem :(

al42and avatar May 27 '24 19:05 al42and

But perhaps queue.submit([=](handler &CGH) { CGH.ext_oneapi_barrier(); }); (which ext_oneapi_submit_barrier does on non-optimized path) would still work? The queue::submit seems to always return the new event (although I did not dig through all the layers of abstraction).

As far as I can tell this only creates a detail::CGBarrier from the event-wait list that's passed in. For any of the barriers to work they would have to be specialized on in-order queues in some way, because submitting a backend event for a barrier only makes sense for in-order queues.

It uses the same sycl::detail::make_event function internally, which sets IsRecorded to false thus short-circuiting any logic that uses the sycl::event, so that still won't solve your problem :(

I can confirm that this compiles but makes no progress just like calling detail::make_event manually. Is the IsRecorded property part of UR? I couldn't find it in the DPC++ sources. If I understand correctly, this means that CUDA make_event is implemented but fully useless at the moment, i.e. a bug?

fknorr avatar May 28 '24 08:05 fknorr

Is the IsRecorded property part of UR? I couldn't find it in the DPC++ sources.

Yes, it's in UR: https://github.com/oneapi-src/unified-runtime/blob/905804c2e93dd046140057fd07a5d6191063bedc/source/adapters/cuda/event.cpp#L35-L43

If I understand correctly, this means that CUDA make_event is implemented but fully useless at the moment, i.e. a bug?

Not sure it's a bug (interop is quite vaguely described in the standard), but it definitely makes the current functionality quite useless.

I guess one solution would be for UR to check the event status and, if it is "running", then set IsRecorded to true. It still will be incomplete as far as SYCL events are concerned: no way to get some information descriptors like timing/queue etc, but at least more useful than it is now.

al42and avatar May 28 '24 09:05 al42and

I guess one solution would be for UR to check the event status and, if it is "running", then set IsRecorded to true.

Either that, or expose ur_event_handle_t_::record somehow.

fknorr avatar May 28 '24 10:05 fknorr