quda
quda copied to clipboard
feature/omptarget
The OpenMP target backend here is still a work in progress. We welcome any suggestions.
As of now this port uses a few Intel extensions, contains hacks specifically for Intel architectures, and it only works on Intel GPUs.
For a quick test, try
cmake\
-DCMAKE_BUILD_TYPE=RELEASE\
-DQUDA_TARGET_TYPE=OMPTARGET\
-DQUDA_DOWNLOAD_USQCD=on\
-DQUDA_QMP=on\
-DQUDA_QIO=on\
-DQUDA_DIRAC_DEFAULT_OFF=on\
-DQUDA_DIRAC_STAGGERED=on\
-DQUDA_PRECISION=8\
-DQUDA_RECONSTRUCT=4\
-DQUDA_FAST_COMPILE_REDUCE=on\
-DQUDA_FAST_COMPILE_DSLASH=on\
-DQUDA_BUILD_NATIVE_LAPACK=off\
-DCMAKE_CXX_COMPILER=mpic++\
-DCMAKE_C_COMPILER=mpicc\
../quda
Jenkins: Can one of the admins verify this patch?
This is not ready for merge yet. Just list here for interested people.
Great to get this up as a draft PR @jxy 😄
What compilers have you tested this with?
It currently only works with Intel's. More information here: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/openmp-offloading-intro/openmp-compile-and-run.html
So are you using Intel specific extensions, or is it that other compilers are lacking features? Curious to know what is missing, for example, with NVIDIA's OMP compiler.
There are three reasons.
- QUDA's
mapped_malloc
currently usesomp_target_alloc_shared
, which is an Intel extension. - Different OpenMP implementation may have different interpretation of the specification, and I spent most of my effort on Intel's implementation. I haven't tried Nvidia's OMP compiler. Last I tried with llvm (v12 and v13) on Nvidia GPUs (manually copy memories for mapped_alloc), there were issues with atomic, as well as these two bug reports:
- llvm/llvm-project#51447
- llvm/llvm-project#51451
- There are dirty hacks in the code definitely waiting for better solutions: a. get pointer location (required for qudaMemcpyDefault) https://github.com/jxy/quda/blob/6329d5735394736dd27289791cbbd5636bd78098/lib/targets/omptarget/malloc.cpp#L652-L660 b. a single global address for shared memory per team https://github.com/jxy/quda/blob/6329d5735394736dd27289791cbbd5636bd78098/include/targets/omptarget/kernel.h#L173-L174