quda icon indicating copy to clipboard operation
quda copied to clipboard

feature/omptarget

Open jxy opened this issue 2 years ago • 6 comments

The OpenMP target backend here is still a work in progress. We welcome any suggestions.

As of now this port uses a few Intel extensions, contains hacks specifically for Intel architectures, and it only works on Intel GPUs.

For a quick test, try

cmake\
        -DCMAKE_BUILD_TYPE=RELEASE\
        -DQUDA_TARGET_TYPE=OMPTARGET\
        -DQUDA_DOWNLOAD_USQCD=on\
        -DQUDA_QMP=on\
        -DQUDA_QIO=on\
        -DQUDA_DIRAC_DEFAULT_OFF=on\
        -DQUDA_DIRAC_STAGGERED=on\
        -DQUDA_PRECISION=8\
        -DQUDA_RECONSTRUCT=4\
        -DQUDA_FAST_COMPILE_REDUCE=on\
        -DQUDA_FAST_COMPILE_DSLASH=on\
        -DQUDA_BUILD_NATIVE_LAPACK=off\
        -DCMAKE_CXX_COMPILER=mpic++\
        -DCMAKE_C_COMPILER=mpicc\
        ../quda

jxy avatar Apr 20 '22 21:04 jxy

Jenkins: Can one of the admins verify this patch?

mathiaswagner avatar Apr 20 '22 21:04 mathiaswagner

This is not ready for merge yet. Just list here for interested people.

jxy avatar Apr 20 '22 22:04 jxy

Great to get this up as a draft PR @jxy 😄

What compilers have you tested this with?

maddyscientist avatar Apr 21 '22 17:04 maddyscientist

It currently only works with Intel's. More information here: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/openmp-offloading-intro/openmp-compile-and-run.html

jxy avatar Apr 21 '22 17:04 jxy

So are you using Intel specific extensions, or is it that other compilers are lacking features? Curious to know what is missing, for example, with NVIDIA's OMP compiler.

maddyscientist avatar Apr 21 '22 18:04 maddyscientist

There are three reasons.

  1. QUDA's mapped_malloc currently uses omp_target_alloc_shared, which is an Intel extension.
  2. Different OpenMP implementation may have different interpretation of the specification, and I spent most of my effort on Intel's implementation. I haven't tried Nvidia's OMP compiler. Last I tried with llvm (v12 and v13) on Nvidia GPUs (manually copy memories for mapped_alloc), there were issues with atomic, as well as these two bug reports:
    • llvm/llvm-project#51447
    • llvm/llvm-project#51451
  3. There are dirty hacks in the code definitely waiting for better solutions: a. get pointer location (required for qudaMemcpyDefault) https://github.com/jxy/quda/blob/6329d5735394736dd27289791cbbd5636bd78098/lib/targets/omptarget/malloc.cpp#L652-L660 b. a single global address for shared memory per team https://github.com/jxy/quda/blob/6329d5735394736dd27289791cbbd5636bd78098/include/targets/omptarget/kernel.h#L173-L174

jxy avatar Apr 21 '22 19:04 jxy