[SYCL][Draft] Make some changes for range rounding
This commit adds new clang command line options -fsycl-force-range-rounding as well as -fsycl-exp-range-rounding. Experimental range rounding maps all 1, 2 or 3 dim range kernels to a 1d range kernel. This can give performance improvements when inner dimensions are strangely shaped.
Force range rounding makes only range rounded kernels be generated, which should half the size of device binaries that make use of range rounded kernels. This is because at present, all sycl::range kernels generate two kernels: one with range rounding, and one without. Allowing the user to opt out of creating the unrounded kernel is a good idea IMO.
@intel/dpcpp-clang-driver-reviewers I am not sure about compiler flags. Instead of introducing the new -fsycl-force-range-rounding, should I instead make a new flag -fsycl-range-rounding=[force/disable] which would deprecate the existing -fsycl-disable-range-rounding?
@intel/dpcpp-clang-driver-reviewers I am not sure about compiler flags. Instead of introducing the new
-fsycl-force-range-rounding, should I instead make a new flag-fsycl-range-rounding=[force/disable]which would deprecate the existing-fsycl-disable-range-rounding?
I like the idea of introducing a new -fsycl-range-rounding=<arg> option. It alleviates any potential confusion when multiple range-rounding options are used on the command line.
+1 to -fsycl-range-rounding=[disable|on|force]. I'd also split the experimental 1D rounding into a separate PR.
@mdtoguchi @aelovikov-intel changes for the new -fsycl-range-rounding flag are here. I will keep this PR for the experimental range rounding
I tested this PR with https://gist.github.com/rafbiels/2b584584cfd6412e6b255adab4c264d6 on NVIDIA sm_86 GPU. Here's the outcomes:
Release Time [ms] CUDA gridXYZ CUDA blockXYZ
-----------------------------------------------------------
2023.2.1 63 {2, 1888, 1} {394, 1, 1}
2024.0.2 223 {788, 3, 1} {1, 768, 1}
test1 66 {197, 16, 1} {4, 118, 1}
test2 61 {25, 118, 1} {32, 16, 1}
test1 = hdelan:1d-range-rounding with default flags
test2 = hdelan:1d-range-rounding with -fsycl-exp-range-rounding
The improvement between 2024.0.2 and test1 is thanks to https://github.com/intel/llvm/pull/12663 but we can see that this PR (test2) improves the local range assignment further and the performance surpasses the earlier naive implementation from 2023.2.1.
Ping @againull @intel/llvm-reviewers-runtime @intel/dpcpp-cfe-reviewers @intel/dpcpp-clang-driver-reviewers @intel/unified-runtime-reviewers
I tested this PR with https://gist.github.com/rafbiels/2b584584cfd6412e6b255adab4c264d6 on NVIDIA sm_86 GPU. Here's the outcomes:
Can we add it into test-e2e/PerformanceTests ?
I tested this PR with https://gist.github.com/rafbiels/2b584584cfd6412e6b255adab4c264d6 on NVIDIA sm_86 GPU. Here's the outcomes:
Can we add it into
test-e2e/PerformanceTests?
I've added a test to PerformanceTests. I'm not sure if the output needs to be formatted in a certain way or if what I have done is sufficient. Would be great if you could let me know
I'm not sure if the output needs to be formatted in a certain way or if what I have done is sufficient.
The perf tests stuff is in a very early stage, we're basically finding it out as we go. Currently it's just the logs captured for the tests there even in case of a PASS, the rest is left to humans to interpret.
The perf tests stuff is in a very early stage, we're basically finding it out as we go. Currently it's just the logs captured for the tests there even in case of a PASS, the rest is left to humans to interpret.
OK sure thanks
Actually I just realized there are no FE tests. Please add a test for the predefines.
Test added, thanks.
@intel/llvm-gatekeepers this can be merged now