llvm
llvm copied to clipboard
[SYCL][CUDA][HIP] CUDA, and HIP devices ignore required subgroup size kernel attribute
Describe the bug
The required subgroup size kernel attribute is ignored on cuda and hip devices. When checking what the compile subgroup size is of a kernel that had the required subgroup size attribute set, 0 is returned.
To reproduce
- Include a code snippet that is as short as possible
#include <iostream>
#include <sycl/sycl.hpp>
using namespace sycl;
class Kernel1;
// Change this, to a value that is inside sg_sizes
const int SG_SIZE = 32;
int main() {
queue Q(gpu_selector_v);
device D = Q.get_device();
std::vector<size_t> sg_sizes = D.get_info<info::device::sub_group_sizes>();
std::cout << "Supported subgroup sizes for "
<< D.get_info<info::device::name>() << ": { ";
for (size_t size : sg_sizes) {
std::cout << size << " ";
}
std::cout << "}" << std::endl;
Q.submit([&](handler &h) {
h.parallel_for<Kernel1>(nd_range<1>(512, 128),
[=](nd_item<1> ndi)
[[sycl::reqd_sub_group_size(SG_SIZE)]] {});
}).wait();
auto KernelID = get_kernel_id<Kernel1>();
auto Kernel =
get_kernel_bundle<bundle_state::executable>(Q.get_context(), {KernelID})
.get_kernel(KernelID);
std::cout << "Tried to use subgroup size: " << SG_SIZE << std::endl;
std::cout
<< "Actual used subgroup size: "
<< Kernel.get_info<info::kernel_device_specific::compile_sub_group_size>(
D)
<< std::endl;
std::cout << "Done" << std::endl;
return 0;
}
- Specify the command which should be used to compile the program
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda sg-bug.cpp -o sg-bug - Specify the command which should be used to launch the program
./sg-bug - Indicate what is wrong and what was expected
we expect that the subgroup size set by
reqd_sub_group_size, should be the same as the one reported fromcompile_sub_group_size. Instead, we get a 0.
Supported subgroup sizes for NVIDIA GeForce RTX 2060: { 32 }
Tried to use subgroup size: 32
Actual used subgroup size: 0
Done
Environment
- OS: Linux
- Target device and vendor: NVIDIA GPU (similar behaviour was observed on HIP as well)
- Dependencies version:
Platform [#3]:
Version : CUDA 12.4
Name : NVIDIA CUDA BACKEND
Vendor : NVIDIA Corporation
Devices : 1
Device [#0]:
Type : gpu
Version : 7.5
Name : NVIDIA GeForce RTX 2060
Vendor : NVIDIA Corporation
Driver : CUDA 12.4
UUID : 862229814554125179692371413399212160130147
Num SubDevices : 0
Num SubSubDevices : 0
Aspects : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag
info::device::sub_group_sizes: 32
Architecture: nvidia_gpu_sm_75
Additional context
SubGroup/attributes.cpp e2e test is currently marked as unsupported on cuda/hip due to this behaviour