quda icon indicating copy to clipboard operation
quda copied to clipboard

Query: QUDA Feature-SYCL branch

Open Soujanyajanga opened this issue 2 years ago • 19 comments

In the QUDA feature/sycl branch, is this SYCL backend fully functional. Does it work on NVIDIA as well or is it intended only for INTEL architectures. Please share the steps to excise tests on INTEL/NVIDIA platform.

Soujanyajanga avatar Nov 04 '22 13:11 Soujanyajanga

It is essentially fully functional. Depending on which version of oneapi and hardware you run with there may be some issues though. It requires Intel SYCL since it uses some Intel extensions. I've only tried it on Intel hardware, but it might run with the CUDA backend for Intel LLVM as well. Note that there are some changes to follow the SYCL 2020 spec that are in the upstream Intel LLVM repo which I haven't updated the code for yet. It should work with the current public oneapi release though. An example build and test commands (which will need updating soon) are below.

export QUDA_TARGET=SYCL export CXX=dpcpp o="$o -DCMAKE_INSTALL_PREFIX=" o="$o -DQUDA_FAST_COMPILE_REDUCE=ON" o="$o -DQUDA_BUILD_SHAREDLIB=OFF" cmake $o <path to QUDA-SYCL>

make make test

jcosborn avatar Nov 04 '22 14:11 jcosborn

It is essentially fully functional. Depending on which version of oneapi and hardware you run with there may be some issues though. It requires Intel SYCL since it uses some Intel extensions. I've only tried it on Intel hardware, but it might run with the CUDA backend for Intel LLVM as well. Note that there are some changes to follow the SYCL 2020 spec that are in the upstream Intel LLVM repo which I haven't updated the code for yet. It should work with the current public oneapi release though. An example build and test commands (which will need updating soon) are below.

export QUDA_TARGET=SYCL export CXX=dpcpp o="$o -DCMAKE_INSTALL_PREFIX=" o="$o -DQUDA_FAST_COMPILE_REDUCE=ON" o="$o -DQUDA_BUILD_SHAREDLIB=OFF" cmake $o

make make test

Was this SYCL backend tested with CLANG compiler.

Soujanyajanga avatar Nov 07 '22 06:11 Soujanyajanga

I've only tested it with dpcpp/icpx.

jcosborn avatar Nov 16 '22 15:11 jcosborn

I've only tested it with dpcpp/icpx.

Following error is observed with latest code [ 12%] Building CXX object lib/CMakeFiles/quda_cpp.dir/inv_ca_gcr.cpp.o [ 12%] Building CXX object lib/CMakeFiles/quda_cpp.dir/dirac.cpp.o clang-16: error: unknown argument: '-fhonor-nan-compares' clang-16: error: unknown argument: '-fhonor-nan-compares'

Soujanyajanga avatar Nov 21 '22 09:11 Soujanyajanga

@jcosborn with latest intel LLVM compiler

Following error is observed with latest code [ 12%] Building CXX object lib/CMakeFiles/quda_cpp.dir/inv_ca_gcr.cpp.o [ 12%] Building CXX object lib/CMakeFiles/quda_cpp.dir/dirac.cpp.o clang-16: error: unknown argument: '-fhonor-nan-compares' clang-16: error: unknown argument: '-fhonor-nan-compares'

This error is from file "quda/lib/targets/sycl/target_sycl.cmake" if("x${CMAKE_CXX_COMPILER_ID}" STREQUAL "xClang" OR 103 "x${CMAKE_CXX_COMPILER_ID}" STREQUAL "xIntelLLVM") 104 #target_compile_options(quda INTERFACE -fhonor-nan-compares) 105 #target_compile_options(quda PRIVATE -fhonor-nan-compares) 106 target_compile_options(quda PUBLIC -fhonor-nan-compares) >>>>>>>> as CLANG does not have support for this flag 107 target_compile_options(quda PUBLIC -Wno-tautological-constant-compare)

Soujanyajanga avatar Nov 25 '22 10:11 Soujanyajanga

Thanks for reporting that. This is fixed now. I have successfully tested it on Intel, but had issues on NVIDIA.

jcosborn avatar Dec 14 '22 18:12 jcosborn

@jcosborn what are the issues on NVIDIA?

maddyscientist avatar Dec 14 '22 18:12 maddyscientist

I get a bunch of errors like: ptxas error : Entry function '_ZTSZZN4quda6launchINS_9Kernel3DSINS_14dslash_functorENS_18dslash_functor_argINS_19domainWall4DFusedM5ENS_9packShmemELi2ELb0ELb1ELNS_10KernelTypeE5ENS_22DomainWall4DFusedM5ArgIsLi3ELi4EL21QudaReconstructType_s8ELNS_11Dslash5TypeE8EEEEELb0EEESB_EENSt9enable_ifIXntclsr6deviceE14use_kernel_argIT0_EEE11qudaError_tE4typeERKNS_12qudaStream_tERN4sycl3_V18nd_rangeILi3EEERKSE_ENKUlRNSM_7handlerEE_clEST_EUlNSM_7nd_itemILi3EEEE__with_offset' uses too much shared data (0x18000 bytes, 0xc000 max)

jcosborn avatar Dec 14 '22 19:12 jcosborn

Ok, it looks like you (or the SYCL backend) is using static shared memory as opposed to dynamic shared memory: the former has a limit of 48 KiB per thread block, the latter has a much larger limit (96 KiB on Volta, ~164 KiB on Ampere, ~228 KiB on Hopper). Is this something one has control of with SYCL on NVIDIA, or is it out of your hands?

maddyscientist avatar Dec 14 '22 19:12 maddyscientist

I wasn't setting the compute capability before, I'm trying again with sm_80. I'm not sure what else I can change yet.

jcosborn avatar Dec 14 '22 19:12 jcosborn

I though this line controls the size, no? https://github.com/lattice/quda/blob/aa2ea419ce0f6f78f842f85f40cb2a607944c957/include/targets/sycl/target_device.h#L196

jxy avatar Dec 14 '22 19:12 jxy

@jcosborn the compute capability shouldn't matter here as the static limit is 48 KiB for all CUDA GPUs since Fermi (2010). The fact that the compile throws this error indicates that static shared memory is being used as opposed to dynamic, and this is the first red flag here. For dynamic shared memory, the compiler doesn't know what the shared memory per block is so it can't throw an error like this.

At least with the CUDA target, with static shared memory, it doesn't surprise me an excess amount would be produced, as the SharedMemoryCacheHelper with a static allocation will request as much shared memory is required for the maximum block size (1024 threads).

maddyscientist avatar Dec 14 '22 19:12 maddyscientist

Yes, it seems it will only use static shared memory: https://github.com/intel/llvm/pull/3329

I'll see what I can get to compile now, and look into setting a limit for it.

jcosborn avatar Dec 14 '22 20:12 jcosborn

I have also several issues in compiling this branch of QUDA as well as some questions.

Questions:

  1. Do you assume the user compiles this software using dpcpp, in particualr the one from oneAPI-2022.1.0? I ask this question because some files include sycl/ext/oneapi/experimental/builtins.hpp, which can be found in the 2022 version of oneAPI distribution but not in the version 2021.2.0.
  2. What are the command line options you used when installing oneAPI? I am wondering this because lib/targets/sycl/blas_lapack_mkl.cpp includes a file oneapi/mkl.hpp when QUDA_NATIVE_LAPACK is set True, which is the default. I assume this is part of oneAPI as the path contains oneapi. However, I was not able to locate this file in my oneAPI distribution.

There are some error massges when I try to compile QUDA of this branch.

  1. lib/targets/sycl/device.cpp:105:91: error: 'max_work_item_sizes' does not name a template but is followed by template arguments. max_work_item_sizes is set in include/sycl/CL/sycl/info/info_desc.hpp from oneAPI to be max_work_item_sizes = CL_DEVICE_MAX_WORK_ITEM_SIZES. In turn, CL_DEVICE_MAX_WORK_ITEM_SIZES is set in include/sycl/CL/cl.h using #define. I'm not sure why I got this error. Is this due to incorrect installation of oneAPI or some missing command line argument for cmake when compiling QUDA?
  2. There are other errors like the one above such as lib/targets/sycl/device.cpp:81:16: error: no matching conversion for functional-style cast from 'int (*)(const sycl::device &)' to 'sycl::platform' These seem to suggest that I use sycl version or implementation different from what is assumed to be used for this branch of QUDA.
The list of similar errors
/cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:81:16: error: no matching conversion for functional-style cast from 'int (*)(const sycl::device &)' to 'sycl::platform'
    auto p = sycl::platform(mySelector);
             ^~~~~~~~~~~~~~~~~~~~~~~~~
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:56:12: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'const sycl::device_selector' for 1st argument
explicit platform(const device_selector &DeviceSelector);
         ^
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:58:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'const sycl::platform' for 1st argument
platform(const platform &rhs) = default;
^
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:60:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'sycl::platform' for 1st argument
platform(platform &&rhs) = default;
^
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:148:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'std::shared_ptr<detail::platform_impl>' for 1st argument
platform(std::shared_ptr<detail::platform_impl> impl) : impl(impl) {}
^
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:37:3: note: candidate constructor not viable: requires 0 arguments, but 1 was provided
platform();
^
/cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:105:110: error: invalid operands to binary expression ('sycl::info::device' and 'int')
    printfQuda("  Max work item sizes: %s\n", str(myDevice.get_info<sycl::info::device::max_work_item_sizes<3>>()).c_str());
                                                                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
/cyclamen/home/syamamoto/src/quda_sycl/lib/../include/util_quda.h:91:29: note: expanded from macro 'printfQuda'
sprintf(getPrintBuffer(), __VA_ARGS__);              \
                          ^~~~~~~~~~~
/cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:105:91: error: 'max_work_item_sizes' does not name a template but is followed by template arguments
    printfQuda("  Max work item sizes: %s\n", str(myDevice.get_info<sycl::info::device::max_work_item_sizes<3>>()).c_str());
                                                                                        ^                  ~~~~
/cyclamen/home/syamamoto/src/quda_sycl/lib/../include/util_quda.h:91:29: note: expanded from macro 'printfQuda'
sprintf(getPrintBuffer(), __VA_ARGS__);              \
                          ^~~~~~~~~~~
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/info/info_desc.hpp:55:3: note: non-template declaration found by name lookup
max_work_item_sizes = CL_DEVICE_MAX_WORK_ITEM_SIZES,
^
/cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:146:16: error: no matching conversion for functional-style cast from 'int (*)(const sycl::device &)' to 'sycl::platform'
    auto p = sycl::platform(mySelector);
             ^~~~~~~~~~~~~~~~~~~~~~~~~
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:56:12: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'const sycl::device_selector' for 1st argument
explicit platform(const device_selector &DeviceSelector);
         ^
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:58:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'const sycl::platform' for 1st argument
platform(const platform &rhs) = default;
^
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:60:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'sycl::platform' for 1st argument
platform(platform &&rhs) = default;
^
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:148:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'std::shared_ptr<detail::platform_impl>' for 1st argument
platform(std::shared_ptr<detail::platform_impl> impl) : impl(impl) {}
^
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:37:3: note: candidate constructor not viable: requires 0 arguments, but 1 was provided
platform();
^
/cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:154:16: error: no matching conversion for functional-style cast from 'int (*)(const sycl::device &)' to 'sycl::platform'
    auto p = sycl::platform(mySelector);
             ^~~~~~~~~~~~~~~~~~~~~~~~~
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:56:12: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'const sycl::device_selector' for 1st argument
explicit platform(const device_selector &DeviceSelector);
         ^
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:58:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'const sycl::platform' for 1st argument
platform(const platform &rhs) = default;
^
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:60:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'sycl::platform' for 1st argument
platform(platform &&rhs) = default;
^
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:148:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'std::shared_ptr<detail::platform_impl>' for 1st argument
platform(std::shared_ptr<detail::platform_impl> impl) : impl(impl) {}
^
/onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:37:3: note: candidate constructor not viable: requires 0 arguments, but 1 was provided
platform();
^
/cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:161:17: error: no namespace named 'device' in namespace 'sycl::info'; did you mean simply 'device'?
      namespace id = sycl::info::device;
                     ^~~~~~~~~~~~~~~~~~
                     device
/cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:59:13: note: namespace 'device' defined here
namespace device
          ^
/cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:164:81: error: no member named 'name' in namespace 'quda::device'
      printfQuda("%d - name:                    %s\n", device, d.get_info<id::name>().c_str());
                                                                          ~~~~^
/cyclamen/home/syamamoto/src/quda_sycl/lib/../include/util_quda.h:91:29: note: expanded from macro 'printfQuda'
sprintf(getPrintBuffer(), __VA_ARGS__);              \
                          ^~~~~~~~~~~
/cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:364:75: error: invalid operands to binary expression ('sycl::info::device' and 'int')
    auto val = myDevice.get_info<sycl::info::device::max_work_item_sizes<3>>();
                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
/cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:364:56: error: 'max_work_item

sy3394 avatar Feb 03 '23 16:02 sy3394

Yes, it generally requires the latest version of oneAPI (or intel-llvm). I'm currently testing with 2023.0.0. The issues you are seeing are due to differences in the older version of oneAPI.

jcosborn avatar Feb 03 '23 16:02 jcosborn

Thank you for your prompt reply. I will install the new version and try it out.

Meanwhile, I have another simple question. I am trying to compile QUDA targeting SYCL because I want to use QUDA in the enviornment possibly without GPUs for testing purposes. Performance is not my main concern. I just need to run QUDA wihout GPUs. I assume this branch of QUDA works on CPUs. Am I correct?

sy3394 avatar Feb 06 '23 13:02 sy3394

Yes, it works with the opencl:cpu backend, though performance isn't very good.

jcosborn avatar Feb 06 '23 19:02 jcosborn

@jcosborn I have tried the following compiling commands, but have encountered some errors when linking.

It is essentially fully functional. Depending on which version of oneapi and hardware you run with there may be some issues though. It requires Intel SYCL since it uses some Intel extensions. I've only tried it on Intel hardware, but it might run with the CUDA backend for Intel LLVM as well. Note that there are some changes to follow the SYCL 2020 spec that are in the upstream Intel LLVM repo which I haven't updated the code for yet. It should work with the current public oneapi release though. An example build and test commands (which will need updating soon) are below.

export QUDA_TARGET=SYCL export CXX=dpcpp o="$o -DCMAKE_INSTALL_PREFIX=" o="$o -DQUDA_FAST_COMPILE_REDUCE=ON" o="$o -DQUDA_BUILD_SHAREDLIB=OFF" cmake $o

make make test

The following errors occur at linking. The compiler I used is icpx-2023.2.4, together with OneMKL-2023.0.0 version.

[ 82%] Linking CXX executable gauge_alg_test
icpx: warning: use of 'dpcpp' is deprecated and will be removed in a future release. Use 'icpx -fsycl' [-Wdeprecated]
/usr/bin/ld: ../lib/libquda.a(timer.cpp.o): unable to initialize decompress status for section .debug_loc
/usr/bin/ld: ../lib/libquda.a(timer.cpp.o): unable to initialize decompress status for section .debug_loc
../lib/libquda.a: error adding symbols: File format not recognized
icpx: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [tests/CMakeFiles/gauge_alg_test.dir/build.make:99: tests/gauge_alg_test] Error 1
make[1]: *** [CMakeFiles/Makefile2:1088: tests/CMakeFiles/gauge_alg_test.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

li12242 avatar Mar 16 '24 09:03 li12242

@jcosborn I have tried the following compiling commands, but have encountered some errors when linking.

It is essentially fully functional. Depending on which version of oneapi and hardware you run with there may be some issues though. It requires Intel SYCL since it uses some Intel extensions. I've only tried it on Intel hardware, but it might run with the CUDA backend for Intel LLVM as well. Note that there are some changes to follow the SYCL 2020 spec that are in the upstream Intel LLVM repo which I haven't updated the code for yet. It should work with the current public oneapi release though. An example build and test commands (which will need updating soon) are below. export QUDA_TARGET=SYCL export CXX=dpcpp o="$o -DCMAKE_INSTALL_PREFIX=" o="$o -DQUDA_FAST_COMPILE_REDUCE=ON" o="$o -DQUDA_BUILD_SHAREDLIB=OFF" cmake $o make make test

The following errors occur at linking. The compiler I used is icpx-2023.2.4, together with OneMKL-2023.0.0 version.

[ 82%] Linking CXX executable gauge_alg_test
icpx: warning: use of 'dpcpp' is deprecated and will be removed in a future release. Use 'icpx -fsycl' [-Wdeprecated]
/usr/bin/ld: ../lib/libquda.a(timer.cpp.o): unable to initialize decompress status for section .debug_loc
/usr/bin/ld: ../lib/libquda.a(timer.cpp.o): unable to initialize decompress status for section .debug_loc
../lib/libquda.a: error adding symbols: File format not recognized
icpx: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [tests/CMakeFiles/gauge_alg_test.dir/build.make:99: tests/gauge_alg_test] Error 1
make[1]: *** [CMakeFiles/Makefile2:1088: tests/CMakeFiles/gauge_alg_test.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

Sorry for the mistakes. I updated the binutils tools and the errors are disappeared.

li12242 avatar Mar 18 '24 03:03 li12242