llvm [SYCL][Draft] Make some changes for range rounding

This commit adds new clang command line options -fsycl-force-range-rounding as well as -fsycl-exp-range-rounding. Experimental range rounding maps all 1, 2 or 3 dim range kernels to a 1d range kernel. This can give performance improvements when inner dimensions are strangely shaped.

Force range rounding makes only range rounded kernels be generated, which should half the size of device binaries that make use of range rounded kernels. This is because at present, all sycl::range kernels generate two kernels: one with range rounding, and one without. Allowing the user to opt out of creating the unrounded kernel is a good idea IMO.

Feb 12 '24 18:02 hdelan

@intel/dpcpp-clang-driver-reviewers I am not sure about compiler flags. Instead of introducing the new -fsycl-force-range-rounding, should I instead make a new flag -fsycl-range-rounding=[force/disable] which would deprecate the existing -fsycl-disable-range-rounding?

Feb 12 '24 18:02 hdelan

@intel/dpcpp-clang-driver-reviewers I am not sure about compiler flags. Instead of introducing the new -fsycl-force-range-rounding, should I instead make a new flag -fsycl-range-rounding=[force/disable] which would deprecate the existing -fsycl-disable-range-rounding?

I like the idea of introducing a new -fsycl-range-rounding=<arg> option. It alleviates any potential confusion when multiple range-rounding options are used on the command line.

Feb 12 '24 18:02 mdtoguchi

+1 to -fsycl-range-rounding=[disable|on|force]. I'd also split the experimental 1D rounding into a separate PR.

Feb 12 '24 18:02 aelovikov-intel

@mdtoguchi @aelovikov-intel changes for the new -fsycl-range-rounding flag are here. I will keep this PR for the experimental range rounding

Feb 14 '24 17:02 hdelan

I tested this PR with https://gist.github.com/rafbiels/2b584584cfd6412e6b255adab4c264d6 on NVIDIA sm_86 GPU. Here's the outcomes:

   Release     Time [ms]     CUDA gridXYZ     CUDA blockXYZ
-----------------------------------------------------------
  2023.2.1            63     {2, 1888, 1}       {394, 1, 1}
  2024.0.2           223      {788, 3, 1}       {1, 768, 1}
     test1            66     {197, 16, 1}       {4, 118, 1}
     test2            61     {25, 118, 1}       {32, 16, 1}

test1 = hdelan:1d-range-rounding with default flags
test2 = hdelan:1d-range-rounding with -fsycl-exp-range-rounding

The improvement between 2024.0.2 and test1 is thanks to https://github.com/intel/llvm/pull/12663 but we can see that this PR (test2) improves the local range assignment further and the performance surpasses the earlier naive implementation from 2023.2.1.

Mar 22 '24 12:03 rafbiels

Ping @againull @intel/llvm-reviewers-runtime @intel/dpcpp-cfe-reviewers @intel/dpcpp-clang-driver-reviewers @intel/unified-runtime-reviewers

Mar 22 '24 16:03 hdelan

I tested this PR with https://gist.github.com/rafbiels/2b584584cfd6412e6b255adab4c264d6 on NVIDIA sm_86 GPU. Here's the outcomes:

Can we add it into test-e2e/PerformanceTests ?

Mar 22 '24 16:03 aelovikov-intel

I tested this PR with https://gist.github.com/rafbiels/2b584584cfd6412e6b255adab4c264d6 on NVIDIA sm_86 GPU. Here's the outcomes:

Can we add it into test-e2e/PerformanceTests ?

I've added a test to PerformanceTests. I'm not sure if the output needs to be formatted in a certain way or if what I have done is sufficient. Would be great if you could let me know

Mar 22 '24 16:03 hdelan

I'm not sure if the output needs to be formatted in a certain way or if what I have done is sufficient.

The perf tests stuff is in a very early stage, we're basically finding it out as we go. Currently it's just the logs captured for the tests there even in case of a PASS, the rest is left to humans to interpret.

Mar 22 '24 16:03 aelovikov-intel

The perf tests stuff is in a very early stage, we're basically finding it out as we go. Currently it's just the logs captured for the tests there even in case of a PASS, the rest is left to humans to interpret.

OK sure thanks

Mar 22 '24 16:03 hdelan

Actually I just realized there are no FE tests. Please add a test for the predefines.

Test added, thanks.

Mar 22 '24 16:03 hdelan

@intel/llvm-gatekeepers this can be merged now

Apr 01 '24 10:04 hdelan