quda feature/omptarget

The OpenMP target backend here is still a work in progress. We welcome any suggestions.

As of now this port uses a few Intel extensions, contains hacks specifically for Intel architectures, and it only works on Intel GPUs.

For a quick test, try

cmake\
        -DCMAKE_BUILD_TYPE=RELEASE\
        -DQUDA_TARGET_TYPE=OMPTARGET\
        -DQUDA_DOWNLOAD_USQCD=on\
        -DQUDA_QMP=on\
        -DQUDA_QIO=on\
        -DQUDA_DIRAC_DEFAULT_OFF=on\
        -DQUDA_DIRAC_STAGGERED=on\
        -DQUDA_PRECISION=8\
        -DQUDA_RECONSTRUCT=4\
        -DQUDA_FAST_COMPILE_REDUCE=on\
        -DQUDA_FAST_COMPILE_DSLASH=on\
        -DQUDA_BUILD_NATIVE_LAPACK=off\
        -DCMAKE_CXX_COMPILER=mpic++\
        -DCMAKE_C_COMPILER=mpicc\
        ../quda

Apr 20 '22 21:04 jxy

Jenkins: Can one of the admins verify this patch?

Apr 20 '22 21:04 mathiaswagner

This is not ready for merge yet. Just list here for interested people.

Apr 20 '22 22:04 jxy

Great to get this up as a draft PR @jxy 😄

What compilers have you tested this with?

Apr 21 '22 17:04 maddyscientist

It currently only works with Intel's. More information here: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/openmp-offloading-intro/openmp-compile-and-run.html

Apr 21 '22 17:04 jxy

So are you using Intel specific extensions, or is it that other compilers are lacking features? Curious to know what is missing, for example, with NVIDIA's OMP compiler.

Apr 21 '22 18:04 maddyscientist

There are three reasons.

QUDA's mapped_malloc currently uses omp_target_alloc_shared, which is an Intel extension.
Different OpenMP implementation may have different interpretation of the specification, and I spent most of my effort on Intel's implementation. I haven't tried Nvidia's OMP compiler. Last I tried with llvm (v12 and v13) on Nvidia GPUs (manually copy memories for mapped_alloc), there were issues with atomic, as well as these two bug reports:
- llvm/llvm-project#51447
- llvm/llvm-project#51451
There are dirty hacks in the code definitely waiting for better solutions: a. get pointer location (required for qudaMemcpyDefault) https://github.com/jxy/quda/blob/6329d5735394736dd27289791cbbd5636bd78098/lib/targets/omptarget/malloc.cpp#L652-L660 b. a single global address for shared memory per team https://github.com/jxy/quda/blob/6329d5735394736dd27289791cbbd5636bd78098/include/targets/omptarget/kernel.h#L173-L174

Apr 21 '22 19:04 jxy

quda quda copied to clipboard

feature/omptarget

quda
quda copied to clipboard