VkFFT
VkFFT copied to clipboard
Find a way to set OpenCL events
In the recent OpenCL backend, the kernel is enqueued with:
result = clEnqueueNDRangeKernel(app->configuration.commandQueue[0], axis->kernel, 3, 0, global_work_size, local_work_size, 0, 0, 0);
i.e. the last three arguments are set to 0. It would be nice to set/have access to these event arguments, which are helpful for synchronization and profiling.
@DTolm, I was thinking this might be handled best by adding num_events_in_wait_list
, event_wait_list
, and event
members to VkFFTLaunchParams
, following the API. The first two denote the events that the first kernel should wait on (i.e., should only be passed at the first kernel invocation). The last one should be passed to the final kernel invocation so that the user can query its status after that last kernel asynchronously returns.
Along these lines, I think it would be best practice to propagate events between each kernel invocation (within a single transform) to ensure synchronization. I'm not familiar with the internals of VkFFT, but one could imagine an out-of-order queue launching a kernel too early.
If this sounds appropriate to you and it would be helpful, I would be willing to work on a PR. Should be pretty simple.
@zachjweiner I don't have much experience with OpenCL events, but they should be similar to what Vulkan has with synchronization when commandBuffer is created, but for kernel dispatches. You can do this, or wait until I finish another code reorganization and split it into multiple headers, it should be easier to understand then.
Taking a closer look, I see now that dispatchEnhanced
isn't passed launchParams
. I suppose the "right" thing to do (based on a cursory look at the general internal workflow of VkFFTAppend
) would be to set
app->configuration.num_events_in_wait_list = launchParams->num_events_in_wait_list;
app->configuration.event_wait_list = launchParams->event_wait_list;
app->configuration.event = launchParams->event;
(and as usual ensure these are zero unless the user sets them). On the first kernel invocation do the equivalent of
result = clEnqueueNDRangeKernel(app->configuration.commandQueue[0], axis->kernel, 3, 0, global_work_size, local_work_size, app->configuration.num_events_in_wait_list, app->configuration.event_wait_list, 0);
and, on the last,
result = clEnqueueNDRangeKernel(app->configuration.commandQueue[0], axis->kernel, 3, 0, global_work_size, local_work_size, 0, 0, app->configuration.event);
Since VkFFTSync
is a no-op on the OpenCL backend (which is good), VkFFTAppend
should release asynchronously with the execution of the final kernel, letting the user wait on their launchParams->event
. (Right now, one technically has to call clFinish
to ensure that kernel finishes before they use the buffers.) Vice-versa, having the first kernel wait on user-supplied events means calling clFinish
before VkFFTAppend
also is no longer required.
Separately, it would be good to have an internal cl_event
between kernel launches. (I'm actually not sure what, if anything, is enforcing synchronization between them at the moment, since the clSetKernelArg
calls shouldn't be blocking.) This should allow the host-side work, plus the enqueuing of the subsequent kernel, to occur asynchronously with the previously-enqueued kernel. I'd think removing that overhead would yield a modest (but not negligible, especially for smaller multi-D transforms) performance boost.