llvm
llvm copied to clipboard
[SYCL] Improve range reduction performance on CPU
The performance improvement is the result of two complementary changes:
-
Using an alternative heuristic to select work-group size on the CPU. Keeping work-groups small simplifies combination of partial results and reduces the number of temporary variables.
-
Adjusting the mapping of the range to an ND-range. Breaking the range into contiguous chunks that are assigned to each results in streaming patterns that are better-suited to prefetching hardware.
Signed-off-by: John Pennycook [email protected]
XPTI makes assumptions about the number of arguments in reduction kernels. Since this patch adds the PerGroup
as an implicit argument these assumptions fail. https://github.com/intel/llvm-test-suite/pull/1040 updates these values.
An alternative is to move the calculation of PerGroup
into the kernel, which means less arguments to the kernel but repeated work for each work-item. @v-klochkov - Thoughts?
/verify with https://github.com/intel/llvm-test-suite/pull/1040
It seems that with the generalization of max work-group size causes issues with user-defined work-group sizes. I think maybe we have this backwards and instead of limiting the max work-group size it should set a "preferred" work-group size that will be picked. This solution would move the early exits to reduComputeWGSize
rather than reduGetMaxWGSize
.
It seems that with the generalization of max work-group size causes issues with user-defined work-group sizes. I think maybe we have this backwards and instead of limiting the max work-group size it should set a "preferred" work-group size that will be picked. This solution would move the early exits to
reduComputeWGSize
rather thanreduGetMaxWGSize
.
I have made changes to the approach so instead of dictating the maximum work-group size, the configs and the 16 wg size for CPU are now a "preferred" work-group size instead, meaning that when the user haven't defined a work-group (i.e. range reduction rather than nd_range) the preferred work-group size is used instead.
As reduComputeWGSize
would be too late to do this, the implementation of this is in a new function (reduGetPreferredWGSize
) which returns reduGetMaxWGSize
if no other configuration is selected, avoiding the overhead if it is not needed.
@v-klochkov & @aelovikov-intel - Let me know what you think. If we agree that this is the way to go, I will change the description of the PR.
@v-klochkov & @aelovikov-intel - Let me know what you think.
Your explanation comment sounds very reasonable, but I'm afraid I'm not familiar enough with the code to make any decisions yet. Letting @v-klochkov to answer...
/verify with https://github.com/intel/llvm-test-suite/pull/1040
Thoughts
Personally, I would proceed with the current patch that has the division on host side, but that opinion is not confirmed by performance benchmarking 'div on host + 1 argument' vs 'div on device in every work-item'. The first solution is potentially a bit faster on small ranges where PerGroup is small.
Please add a test for the new env variable specifying the preferred group size.
Please add a test for the new env variable specifying the preferred group size.
Sorry for the delay. I have added a test to verify that the configuration correctly parses valid configuration values.
/verify with https://github.com/intel/llvm-test-suite/pull/1040
/verify with https://github.com/intel/llvm-test-suite/pull/1040