compute-runtime `ocloc` fails to compile SYCL kernels with an unsupported subgroup size

According to the latest SYCL 2020 specification:

5.7. Optional kernel features

A number of kernel features defined by this SYCL specification are optional; they may be supported on some devices but not on other devices. As described in Section 4.6.4.3, an application can test whether a device supports these features by testing whether the device has an associated aspect. The following aspects are those that correspond to optional kernel features:

fp16

fp64

atomic64

In addition, the following C++ attributes from Section 5.8.1 also correspond to optional kernel features because they force the kernel to be compiled in a way that might not run on all devices:

reqd_work_group_size()

reqd_sub_group_size()

In order to guarantee source code portability of SYCL applications that use optional kernel features, all SYCL implementations must be able to compile device code that uses these optional features regardless of whether the implementation supports the features on any of its devices.

(emphasis added)

When compiling a SYCL/oneAPI application ahead of time for Intel GPUs, the current version of ocloc (23.17.26241.33) fails to compile a kernel that uses a subgroup size that is not supported by the target GPU:

$ icpx -std=c++17 -O2 -g -Wall -fsycl -fsycl-targets=intel_gpu_pvc subgroup_test.cc -o test.gpu
...
error: Unsupported required sub group size
in kernel: 'typeinfo name for main::'lambda'(sycl::_V1::handler&)::operator()(sycl::_V1::handler&) const::'lambda'(sycl::_V1::nd_item<1>)'
error: backend compiler failed build.

Build failed with error code: -11
Command was: /usr/bin/ocloc -output /tmp/subgroup_test-pvc-6e1f5e.out -file /tmp/icpx-354b22/subgroup_test-pvc-d8ca68.spv -output_no_suffix -spirv_input -device pvc -options "-g"
llvm-foreach: 
icpx: error: gen compiler command failed with exit code 245 (use -v to see invocation)
Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/../bin/icpx.cfg
icpx: note: diagnostic msg: Error generating preprocessed source(s).

On the contrary, CodePlay's NVIDIA plugin produces only a warning about unsupported subgroup sizes, and builds the kernel correctly for the supported one:

$ icpx -std=c++17 -O2 -g -Wall -Wno-unknown-cuda-version -fsycl -fsycl-targets=nvidia_gpu_sm_86 subgroup_test.cc -o test.nv
subgroup_test.cc:56:86: warning: attribute argument 4 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(4)]] { do_some_work<4>{}(item); });
                                                                                     ^
subgroup_test.cc:65:86: warning: attribute argument 8 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(8)]] { do_some_work<8>{}(item); });
                                                                                     ^
subgroup_test.cc:73:111: warning: attribute argument 16 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(16)]] {
                                                                                                              ^
subgroup_test.cc:93:111: warning: attribute argument 64 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(64)]] {
                                                                                                              ^
subgroup_test.cc:103:111: warning: attribute argument 128 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(128)]] {
                                                                                                              ^
5 warnings generated.

$ ONEAPI_DEVICE_SELECTOR='cuda:gpu' ./test.nv 
SYCL platform: NVIDIA CUDA BACKEND
  sub-group sizes supported by the device: 32

    test sub-group of 32 elements:
      the expected sub-group size is 32
      the actual sub-group size is 32

A test file to reproduce the issue is attached, and also reproduced here:

`subgroup_test.cc`

#include <cstdio>
#include <iostream>

#include <sycl/sycl.hpp>


#ifdef __SYCL_DEVICE_ONLY__
#    define __DEVICE_CONSTANT__ [[clang::opencl_constant]]
#else
#    define __DEVICE_CONSTANT__
#endif

#define printf(FORMAT, ...)                                                                                           \
    do                                                                                                                \
    {                                                                                                                 \
        static const char* __DEVICE_CONSTANT__ format = FORMAT;                                                       \
        sycl::ext::oneapi::experimental::printf(format, ##__VA_ARGS__);                                               \
    } while(false)


template <uint32_t S>
struct do_some_work {
  void operator()(sycl::nd_item<1> item) const {
    printf("      the expected sub-group size is %d\n", S);
    printf("      the actual sub-group size is %d\n", item.get_sub_group().get_max_local_range()[0]);
  }
};


int main() {
  auto platforms = sycl::platform::get_platforms();

  for (auto const& platform : platforms) {
    std::cout << "SYCL platform: " << platform.get_info<sycl::info::platform::name>() << '\n';
    auto devices = platform.get_devices();

    for (auto const& device : devices) {
      sycl::queue queue{device};

      auto sizes = device.get_info<sycl::info::device::sub_group_sizes>();
      std::cout << "  sub-group sizes supported by the device: " << sizes[0];
      for (int i = 1; i < sizes.size(); ++i) {
        std::cout << ", " << sizes[i];
      }
      std::cout << '\n';

      auto range = sycl::nd_range<1>(1, 1);
      for (int size : sizes) {
        std::cout << "\n    test sub-group of " << size << " elements:\n";

        // check if the kernel should be launched with a subgroup size of 4
        if (size == 4) {
          // launch the kernel with a subgroup size of 4
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1),
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(4)]] { do_some_work<4>{}(item); });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 8
        if (size == 8) {
          // launch the kernel with a subgroup size of 8
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1),
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(8)]] { do_some_work<8>{}(item); });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 16
        if (size == 16) {
          // launch the kernel with a subgroup size of 16
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(16)]] {
              do_some_work<16>{}(item);
            });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 32
        if (size == 32) {
          // launch the kernel with a subgroup size of 32
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(32)]] {
              do_some_work<32>{}(item);
            });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 64
        if (size == 64) {
          // launch the kernel with a subgroup size of 64
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(64)]] {
              do_some_work<64>{}(item);
            });
          }).wait();
        }
 
        // check if the kernel should be launched with a subgroup size of 128
        if (size == 128) {
          // launch the kernel with a subgroup size of 128
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(128)]] {
              do_some_work<128>{}(item);
            });
          }).wait();
        }
      }
    }
    std::cout << '\n';
  }
  std::cout << '\n';
}

Jul 22 '23 10:07 fwyzard

@auroraperego FYI

Jul 22 '23 13:07 fwyzard

@igorvorobtsov FYI

Jul 22 '23 13:07 fwyzard

Hi Andrea, I believe this should be implemented on the SYCL runtime side and not on compute runtime (or probably both). I will escalate this.

Jul 24 '23 11:07 igorvorobtsov

This is indeed a SYCL toolchain problem and I think that this issue can be closed

Jul 25 '24 08:07 AlexeySachkov

compute-runtime compute-runtime copied to clipboard

`ocloc` fails to compile SYCL kernels with an unsupported subgroup size

5.7. Optional kernel features

subgroup_test.cc

compute-runtime
compute-runtime copied to clipboard

`subgroup_test.cc`