llvm `opencl-aot` fails to compile SYCL kernels with an unsupported subgroup size

Describe the bug

When compiling a SYCL/oneAPI application ahead of time for Intel CPUs, the current version of opencl-aot (2023.2.0) fails to compile a kernel that uses a subgroup size that is not supported by the OpenCL runtime.

According to the SYCL specification, all SYCL implementations must be able to compile device code that uses these optional features (various subgroup sizes etc) regardless of whether the implementation supports the features on any of its devices.

To Reproduce

Please describe the steps to reproduce the behavior:

1. Include code snippet as short as possible:

`subgroup_test.cc`

#include <cstdio>
#include <iostream>

#include <sycl/sycl.hpp>


#ifdef __SYCL_DEVICE_ONLY__
#    define __DEVICE_CONSTANT__ [[clang::opencl_constant]]
#else
#    define __DEVICE_CONSTANT__
#endif

#define printf(FORMAT, ...)                                                                                           \
    do                                                                                                                \
    {                                                                                                                 \
        static const char* __DEVICE_CONSTANT__ format = FORMAT;                                                       \
        sycl::ext::oneapi::experimental::printf(format, ##__VA_ARGS__);                                               \
    } while(false)


template <uint32_t S>
struct do_some_work {
  void operator()(sycl::nd_item<1> item) const {
    printf("      the expected sub-group size is %d\n", S);
    printf("      the actual sub-group size is %d\n", item.get_sub_group().get_max_local_range()[0]);
  }
};


int main() {
  auto platforms = sycl::platform::get_platforms();

  for (auto const& platform : platforms) {
    std::cout << "SYCL platform: " << platform.get_info<sycl::info::platform::name>() << '\n';
    auto devices = platform.get_devices();

    for (auto const& device : devices) {
      sycl::queue queue{device};

      auto sizes = device.get_info<sycl::info::device::sub_group_sizes>();
      std::cout << "  sub-group sizes supported by the device: " << sizes[0];
      for (int i = 1; i < sizes.size(); ++i) {
        std::cout << ", " << sizes[i];
      }
      std::cout << '\n';

      auto range = sycl::nd_range<1>(1, 1);
      for (int size : sizes) {
        std::cout << "\n    test sub-group of " << size << " elements:\n";

        // check if the kernel should be launched with a subgroup size of 4
        if (size == 4) {
          // launch the kernel with a subgroup size of 4
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1),
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(4)]] { do_some_work<4>{}(item); });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 8
        if (size == 8) {
          // launch the kernel with a subgroup size of 8
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1),
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(8)]] { do_some_work<8>{}(item); });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 16
        if (size == 16) {
          // launch the kernel with a subgroup size of 16
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(16)]] {
              do_some_work<16>{}(item);
            });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 32
        if (size == 32) {
          // launch the kernel with a subgroup size of 32
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(32)]] {
              do_some_work<32>{}(item);
            });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 64
        if (size == 64) {
          // launch the kernel with a subgroup size of 64
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(64)]] {
              do_some_work<64>{}(item);
            });
          }).wait();
        }
 
        // check if the kernel should be launched with a subgroup size of 128
        if (size == 128) {
          // launch the kernel with a subgroup size of 128
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(128)]] {
              do_some_work<128>{}(item);
            });
          }).wait();
        }
      }
    }
    std::cout << '\n';
  }
  std::cout << '\n';
}

2. Specify the command which should be used to compile the program

$ source /opt/intel/oneapi/setvars.sh
$ icpx -std=c++17 -O2 -g -Wall -fsycl -fsycl-targets=spir64_x86_64 subgroup_test.cc -o test.cpu

3. Specify the comment which should be used to launch the program

ONEAPI_DEVICE_SELECTOR='opencl:cpu' ./test.cpu

4. Indicate what is wrong and what was expected

The program fails to compile, with the error

Failed to build: : -11 (CL_BUILD_PROGRAM_FAILURE)

llvm-foreach: 
icpx: error: x86_64 compiler command failed with exit code 245 (use -v to see invocation)
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230622)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm/../bin/icpx.cfg
icpx: note: diagnostic msg: Error generating preprocessed source(s).

The expected behaviour is that the program should compile correctly, compiling the kernel for all the supported subgroup sizes (4, 8, 16, 32, 64), possibly issuing a warning about the unsupported subgroup sizes (128).

For completeness, CodePlay's NVIDIA plugin produces only a warning about unsupported subgroup sizes, and builds the kernel correctly for the supported one:

$ icpx -std=c++17 -O2 -g -Wall -Wno-unknown-cuda-version -fsycl -fsycl-targets=nvidia_gpu_sm_86 subgroup_test.cc -o test.nv
subgroup_test.cc:56:86: warning: attribute argument 4 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(4)]] { do_some_work<4>{}(item); });
                                                                                     ^
subgroup_test.cc:65:86: warning: attribute argument 8 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(8)]] { do_some_work<8>{}(item); });
                                                                                     ^
subgroup_test.cc:73:111: warning: attribute argument 16 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(16)]] {
                                                                                                              ^
subgroup_test.cc:93:111: warning: attribute argument 64 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(64)]] {
                                                                                                              ^
subgroup_test.cc:103:111: warning: attribute argument 128 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(128)]] {
                                                                                                              ^
5 warnings generated.

$ ONEAPI_DEVICE_SELECTOR='cuda:gpu' ./test.nv 
SYCL platform: NVIDIA CUDA BACKEND
  sub-group sizes supported by the device: 32

    test sub-group of 32 elements:
      the expected sub-group size is 32
      the actual sub-group size is 32

Environment (please complete the following information):

OS: Linux (tested on Ubuntu 22.04 and RHEL 8.7)
Target device and vendor: Intel CPU
DPC++ version: 2023.1.0 and 2023.2.0
Dependencies version: n/a

Additional context

According to the latest SYCL 2020 specification:

5.7. Optional kernel features

A number of kernel features defined by this SYCL specification are optional; they may be supported on some devices but not on other devices. As described in Section 4.6.4.3, an application can test whether a device supports these features by testing whether the device has an associated aspect. The following aspects are those that correspond to optional kernel features:

fp16

fp64

atomic64

In addition, the following C++ attributes from Section 5.8.1 also correspond to optional kernel features because they force the kernel to be compiled in a way that might not run on all devices:

reqd_work_group_size()

reqd_sub_group_size()

In order to guarantee source code portability of SYCL applications that use optional kernel features, all SYCL implementations must be able to compile device code that uses these optional features regardless of whether the implementation supports the features on any of its devices.

(emphasis added)

Note: I would rate this issue as low priority, because the OpenCL CPU runtime supports the widest range of subgroup sizes (4, 8, 16, 32, 64) than any other SYCL backend. So, while the AOT compiler does not follow the SYCL specification, it is unlikely that this specific issue will cause any real world problems, as nobody will likely use subgroup sizes smaller than 4 or larger than 64.

Jul 22 '23 12:07 fwyzard

@auroraperego FYI

Jul 22 '23 13:07 fwyzard

@igorvorobtsov FYI

Jul 22 '23 13:07 fwyzard

Hi @fwyzard, thanks for the report.

Lack of proper support for optional kernel features in AOT mode is a known limitation of our toolchain, but it is unlikely to be supported this year, because it depends on couple other mechanisms and interferes with some ongoing refactoring of the toolchain internals.

Some technical background here:

Two main things need to be supported in order to implement the request:

ability to specify which exact target we should compile for
ability to understand which optional features are supported by a target at compile-time

The first bullet is already partially implemented: -fsycl-targets supports special values, which specify the exact device, but it has limitations:

Special target values specific to Intel, NVIDIA and AMD Processor Graphics support are accepted, providing a streamlined interface for AOT. Only one of these values at a time is supported.

intel_gpu_pvc - Ponte Vecchio Intel graphics architecture

intel_gpu_acm_g12 - Alchemist G12 Intel graphics architecture

intel_gpu_acm_g11 - Alchemist G11 Intel graphics architecture

...

Better implementation, which supports multiple special targets is being designed in #8658

The second bullet will be fulfilled by recently added so-called device config file: #9371, #9846. That work should be expanded further to have full compile-time known database of supported optional features per architecture. Then it should be connected to special targets support and some extra logic in the toolchain to conditionally invoke AOT compiler.

Jul 24 '23 18:07 AlexeySachkov

Hi @AlexeySachkov, thanks for the detailed explanation.

The workaround that we are implementing is to use the new device-specific targets to know the actual device at compile time, and use preprocessor checks to make sure we compile a kernel only if the subgroup size is supported (see https://github.com/alpaka-group/alpaka/pull/1845).

The work on the device config files definitely looks interesting!

Jul 26 '23 12:07 fwyzard

We are making progress towards support for optional kernel features in AOT mode. In particular, this issue should be resolved by #14590. We don't yet have specific targets for each our CPU, but we have registered a list of known supported sub-group sizes for generic spir64_x86_64 target.

Your example application (changed, but still) was added as a test case into our test suite here: sycl/test-e2e/AOT/reqd-sg-size.cpp.

I will keep this issue open for now in case there are further questions or feedback

Jul 25 '24 08:07 AlexeySachkov

Thank you @AlexeySachkov . Is the list of supported subgroup sizes 4, 8, 16, 32, 64 ? Given the time line, I assume this is not in oneAPI 2024.2.1, right ?

Aug 23 '24 08:08 fwyzard

Is the list of supported subgroup sizes 4, 8, 16, 32, 64 ?

Yes

Given the time line, I assume this is not in oneAPI 2024.2.1, right ?

Correct, it will only be a part of the next major release/update and won't be included into hotfix update releases.

Sep 03 '24 07:09 AlexeySachkov

OK, thanks.

Sep 03 '24 07:09 fwyzard