llvm [SYCL] Improve range reduction performance on CPU

The performance improvement is the result of two complementary changes:

Using an alternative heuristic to select work-group size on the CPU. Keeping work-groups small simplifies combination of partial results and reduces the number of temporary variables.
Adjusting the mapping of the range to an ND-range. Breaking the range into contiguous chunks that are assigned to each results in streaming patterns that are better-suited to prefetching hardware.

Signed-off-by: John Pennycook [email protected]

May 17 '22 20:05 Pennycook

XPTI makes assumptions about the number of arguments in reduction kernels. Since this patch adds the PerGroup as an implicit argument these assumptions fail. https://github.com/intel/llvm-test-suite/pull/1040 updates these values.

An alternative is to move the calculation of PerGroup into the kernel, which means less arguments to the kernel but repeated work for each work-item. @v-klochkov - Thoughts?

May 26 '22 12:05 steffenlarsen

/verify with https://github.com/intel/llvm-test-suite/pull/1040

May 26 '22 12:05 steffenlarsen

It seems that with the generalization of max work-group size causes issues with user-defined work-group sizes. I think maybe we have this backwards and instead of limiting the max work-group size it should set a "preferred" work-group size that will be picked. This solution would move the early exits to reduComputeWGSize rather than reduGetMaxWGSize.

May 27 '22 08:05 steffenlarsen

It seems that with the generalization of max work-group size causes issues with user-defined work-group sizes. I think maybe we have this backwards and instead of limiting the max work-group size it should set a "preferred" work-group size that will be picked. This solution would move the early exits to reduComputeWGSize rather than reduGetMaxWGSize.

I have made changes to the approach so instead of dictating the maximum work-group size, the configs and the 16 wg size for CPU are now a "preferred" work-group size instead, meaning that when the user haven't defined a work-group (i.e. range reduction rather than nd_range) the preferred work-group size is used instead.

As reduComputeWGSize would be too late to do this, the implementation of this is in a new function (reduGetPreferredWGSize) which returns reduGetMaxWGSize if no other configuration is selected, avoiding the overhead if it is not needed.

@v-klochkov & @aelovikov-intel - Let me know what you think. If we agree that this is the way to go, I will change the description of the PR.

May 27 '22 10:05 steffenlarsen

@v-klochkov & @aelovikov-intel - Let me know what you think.

Your explanation comment sounds very reasonable, but I'm afraid I'm not familiar enough with the code to make any decisions yet. Letting @v-klochkov to answer...

May 31 '22 16:05 aelovikov-intel

/verify with https://github.com/intel/llvm-test-suite/pull/1040

May 31 '22 18:05 steffenlarsen

Thoughts

Personally, I would proceed with the current patch that has the division on host side, but that opinion is not confirmed by performance benchmarking 'div on host + 1 argument' vs 'div on device in every work-item'. The first solution is potentially a bit faster on small ranges where PerGroup is small.

May 31 '22 19:05 v-klochkov

Please add a test for the new env variable specifying the preferred group size.

Jun 17 '22 15:06 v-klochkov

Please add a test for the new env variable specifying the preferred group size.

Sorry for the delay. I have added a test to verify that the configuration correctly parses valid configuration values.

Aug 12 '22 10:08 steffenlarsen

/verify with https://github.com/intel/llvm-test-suite/pull/1040

Aug 12 '22 12:08 steffenlarsen

/verify with https://github.com/intel/llvm-test-suite/pull/1040

Aug 15 '22 15:08 steffenlarsen

llvm llvm copied to clipboard

[SYCL] Improve range reduction performance on CPU

llvm
llvm copied to clipboard