SYCL Shows Lower Performance Compared to CUDA in Double Precision FFT benchmark sample code.

Open yencal opened this issue 1 year ago • 0 comments

Describe the bug

The double precision performance of fft-sycl lags behind fft-cuda, achieving only 70% of CUDA's performance. Profiling with nsys (nsys nvprof --print-gpu-trace) reveals that the SYCL code utilizes more registers (255 registers/thread) than its CUDA counterpart (72 registers/thread), leading to register spills as observed with -Xcuda-ptxas --verbose.

Interestingly, the single precision version of the code shows comparable performance between SYCL and CUDA. Although SYCL uses (83 registers/thread) and cuda uses (48 registers/thread), there is no register spill for both.

Seeking recommendations to enhance the performance of the double precision SYCL implementation would be greatly appreciated.

To reproduce

Clone https://github.com/zjin-lcf/HeCBench

For sycl (double precission):

cd HeCBench/src/fft-sycl
make CUDA=yes CUDA_ARCH=sm_80 GCC_TOOLCHAIN=""
Run code: ./main 3 1000

For cuda (double precission):

cd HeCBench/src/fft-cuda
make ARCH=sm_80
Run code: ./main 3 1000

To build the single precision versions, ensure the SINGLE_PRECISION preprocessor directive is defined by editing main.cpp or Makefile.

Here is the result of my run on NVIDIA A100:

For Double Precision

SYCL:

~/HeCBench/src/fft-sycl$ ./main 3 1000
used_bytes=268435456, n_cmplx=16777216
FFT PASS
iFFT PASS
Average kernel execution time 0.00117844 (s)

CUDA:

~/HeCBench/src/fft-cuda$ ./main 3 1000
used_bytes=268435456, n_cmplx=1.67772e+07
FFT PASS
iFFT PASS
Average kernel execution time 0.000832436 (s)

For Single Precision

SYCL:

~/HeCBench/src/fft-sycl$ ./main 3 1000
used_bytes=268435456, n_cmplx=33554432
FFT PASS
iFFT PASS
Average kernel execution time 0.000819632 (s)

CUDA:

~/HeCBench/src/fft-cuda$ ./main 3 1000
used_bytes=268435456, n_cmplx=3.35544e+07
FFT PASS
iFFT PASS
Average kernel execution time 0.000798072 (s)

Environment

OS: Ubuntu 23.10
NVIDIA A100
clang++ --version:

$ clang++ --version
clang version 19.0.0git (https://github.com/intel/llvm 666cf66258363ba1c416d054cab38c85c04fe389)
Target: x86_64-unknown-linux-gnu
Thread model: posix
Build config: +assertions

sycl-ls --verbose:

$ sycl-ls --verbose
[opencl:fpga][opencl:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:cpu][opencl:1] Intel(R) OpenCL,            Intel(R) Xeon(R) CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4]

Platforms: 3
Platform [#1]:
    Version  : OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3
    Name     : Intel(R) FPGA Emulation Platform for OpenCL(TM)
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#0]:
        Type              : fpga
        Version           : OpenCL 1.2 
        Name              : Intel(R) FPGA Emulation Device
        Vendor            : Intel(R) Corporation
        Driver            : 2024.17.5.0.08_160000.xmain-hotfix
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : accelerator fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_atomic_host_allocations usm_atomic_shared_allocations ext_oneapi_srgb ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_intel_fpga_task_sequence ext_oneapi_private_alloca
        info::device::sub_group_sizes: 4 8 16 32 64
        Architecture: unknown
Platform [#2]:
    Version  : OpenCL 3.0 LINUX
    Name     : Intel(R) OpenCL
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#1]:
        Type              : cpu
        Version           : OpenCL 3.0 (Build 0)
        Name              :            Intel(R) Xeon(R) CPU @ 2.20GHz
        Vendor            : Intel(R) Corporation
        Driver            : 2024.17.5.0.08_160000.xmain-hotfix
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : cpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_oneapi_srgb ext_oneapi_native_assert ext_intel_legacy_image ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_oneapi_private_alloca
        info::device::sub_group_sizes: 4 8 16 32 64
        Architecture: x86_64
Platform [#3]:
    Version  : CUDA 12.4
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 1
        Device [#0]:
        Type              : gpu
        Version           : 8.0
        Name              : NVIDIA A100-SXM4-40GB
        Vendor            : NVIDIA Corporation
        Driver            : CUDA 12.4
        UUID              : 9031191727913913712018114216122172201180135
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_bfloat16_math_functions ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
 ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag
        info::device::sub_group_sizes: 32
        Architecture: nvidia_gpu_sm_80
default_selector()      : gpu, NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4]
accelerator_selector()  : fpga, Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]
cpu_selector()          : cpu, Intel(R) OpenCL,            Intel(R) Xeon(R) CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
gpu_selector()          : gpu, NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4]
custom_selector(gpu)    : gpu, NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4]
custom_selector(cpu)    : cpu, Intel(R) OpenCL,            Intel(R) Xeon(R) CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
custom_selector(acc)    : fpga, Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]

Additional context

No response

Jun 28 '24 19:06 yencal