[SYCL][CUDA][HIP] CUDA, and HIP devices ignore required subgroup size kernel attribute

Open ayylol opened this issue 1 year ago • 0 comments

Describe the bug

The required subgroup size kernel attribute is ignored on cuda and hip devices. When checking what the compile subgroup size is of a kernel that had the required subgroup size attribute set, 0 is returned.

To reproduce

Include a code snippet that is as short as possible

#include <iostream>
#include <sycl/sycl.hpp>

using namespace sycl;

class Kernel1;

// Change this, to a value that is inside sg_sizes
const int SG_SIZE = 32;

int main() {
  queue Q(gpu_selector_v);
  device D = Q.get_device();
  std::vector<size_t> sg_sizes = D.get_info<info::device::sub_group_sizes>();
  std::cout << "Supported subgroup sizes for "
            << D.get_info<info::device::name>() << ": { ";
  for (size_t size : sg_sizes) {
    std::cout << size << " ";
  }
  std::cout << "}" << std::endl;
  Q.submit([&](handler &h) {
     h.parallel_for<Kernel1>(nd_range<1>(512, 128),
                             [=](nd_item<1> ndi)
                                 [[sycl::reqd_sub_group_size(SG_SIZE)]] {});
   }).wait();
  auto KernelID = get_kernel_id<Kernel1>();
  auto Kernel =
      get_kernel_bundle<bundle_state::executable>(Q.get_context(), {KernelID})
          .get_kernel(KernelID);
  std::cout << "Tried to use subgroup size: " << SG_SIZE << std::endl;
  std::cout
      << "Actual used subgroup size: "
      << Kernel.get_info<info::kernel_device_specific::compile_sub_group_size>(
             D)
      << std::endl;
  std::cout << "Done" << std::endl;
  return 0;
}

Specify the command which should be used to compile the program clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda sg-bug.cpp -o sg-bug
Specify the command which should be used to launch the program ./sg-bug
Indicate what is wrong and what was expected we expect that the subgroup size set by reqd_sub_group_size, should be the same as the one reported from compile_sub_group_size. Instead, we get a 0.

Supported subgroup sizes for NVIDIA GeForce RTX 2060: { 32 }
Tried to use subgroup size: 32
Actual used subgroup size: 0
Done

Environment

OS: Linux
Target device and vendor: NVIDIA GPU (similar behaviour was observed on HIP as well)
Dependencies version:

Platform [#3]:
    Version  : CUDA 12.4
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 1
        Device [#0]:
        Type              : gpu
        Version           : 7.5
        Name              : NVIDIA GeForce RTX 2060
        Vendor            : NVIDIA Corporation
        Driver            : CUDA 12.4
        UUID              : 862229814554125179692371413399212160130147
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
 ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag
        info::device::sub_group_sizes: 32
        Architecture: nvidia_gpu_sm_75

Additional context

SubGroup/attributes.cpp e2e test is currently marked as unsupported on cuda/hip due to this behaviour

Jun 28 '24 18:06 ayylol