Support non-uniform workgroups in Numba-dpex
Currently numba-dpex does not allow submitting kernels with non-uniform work groups, i.e., where the local work-group sizes are not integer factors of the global work-group size. We need to explore if the restriction can be removed or at least made less restrictive.
For reference, there is an OCL extension from ARM https://registry.khronos.org/OpenCL/extensions/arm/cl_arm_non_uniform_work_group_size.txt that may help overcome the issue that we can look at. Worth discussing with the IGC team.
If I may suggest something there: a useful addition could be to expose a keyword that automatically adjust each dimension of the global size to the nearest greatest multiple of the corresponding dimension of the local size. Currently I do it repetitively in each kernel:
global_size = math.ceil(n_work_items / work_group_size) * work_group_size