VkFFT icon indicating copy to clipboard operation
VkFFT copied to clipboard

Find a way to set OpenCL events

Open yves-surrel opened this issue 3 years ago • 3 comments

In the recent OpenCL backend, the kernel is enqueued with:

result = clEnqueueNDRangeKernel(app->configuration.commandQueue[0], axis->kernel, 3, 0, global_work_size, local_work_size, 0, 0, 0);

i.e. the last three arguments are set to 0. It would be nice to set/have access to these event arguments, which are helpful for synchronization and profiling.

yves-surrel avatar Jun 06 '21 08:06 yves-surrel

@DTolm, I was thinking this might be handled best by adding num_events_in_wait_list, event_wait_list, and event members to VkFFTLaunchParams, following the API. The first two denote the events that the first kernel should wait on (i.e., should only be passed at the first kernel invocation). The last one should be passed to the final kernel invocation so that the user can query its status after that last kernel asynchronously returns.

Along these lines, I think it would be best practice to propagate events between each kernel invocation (within a single transform) to ensure synchronization. I'm not familiar with the internals of VkFFT, but one could imagine an out-of-order queue launching a kernel too early.

If this sounds appropriate to you and it would be helpful, I would be willing to work on a PR. Should be pretty simple.

zachjweiner avatar Nov 28 '21 13:11 zachjweiner

@zachjweiner I don't have much experience with OpenCL events, but they should be similar to what Vulkan has with synchronization when commandBuffer is created, but for kernel dispatches. You can do this, or wait until I finish another code reorganization and split it into multiple headers, it should be easier to understand then.

DTolm avatar Dec 02 '21 20:12 DTolm

Taking a closer look, I see now that dispatchEnhanced isn't passed launchParams. I suppose the "right" thing to do (based on a cursory look at the general internal workflow of VkFFTAppend) would be to set

app->configuration.num_events_in_wait_list = launchParams->num_events_in_wait_list;
app->configuration.event_wait_list = launchParams->event_wait_list;
app->configuration.event = launchParams->event;

(and as usual ensure these are zero unless the user sets them). On the first kernel invocation do the equivalent of

result = clEnqueueNDRangeKernel(app->configuration.commandQueue[0], axis->kernel, 3, 0, global_work_size, local_work_size, app->configuration.num_events_in_wait_list, app->configuration.event_wait_list, 0);

and, on the last,

result = clEnqueueNDRangeKernel(app->configuration.commandQueue[0], axis->kernel, 3, 0, global_work_size, local_work_size, 0, 0, app->configuration.event);

Since VkFFTSync is a no-op on the OpenCL backend (which is good), VkFFTAppend should release asynchronously with the execution of the final kernel, letting the user wait on their launchParams->event. (Right now, one technically has to call clFinish to ensure that kernel finishes before they use the buffers.) Vice-versa, having the first kernel wait on user-supplied events means calling clFinish before VkFFTAppend also is no longer required.

Separately, it would be good to have an internal cl_event between kernel launches. (I'm actually not sure what, if anything, is enforcing synchronization between them at the moment, since the clSetKernelArg calls shouldn't be blocking.) This should allow the host-side work, plus the enqueuing of the subsequent kernel, to occur asynchronously with the previously-enqueued kernel. I'd think removing that overhead would yield a modest (but not negligible, especially for smaller multi-D transforms) performance boost.

zachjweiner avatar Dec 02 '21 22:12 zachjweiner