aomp Reduce synchronization needed

See #160 figure. Currently every asynchronous transfer and kernel_dispatch within a target offload has single_wait_scaquire (synchronization) right after. It should be nice to optimize them by enqueuing H2D, kernel_dispatch, D2H and synchronization into an in-order queue. There is only one synchronization needed. This optimization is available in the upstream offload library with CUDA plugin.

Oct 08 '20 23:10 ye-luo

Async is not yet used by the amdgpu backend. I"m expecting performance improvement from moving to it.

Oct 09 '20 00:10 JonChesterfield

@JonChesterfield What is status on async transfer?

Apr 20 '21 13:04 gregrodgers

Wondering what the status/possible timelines here might be? It would help reduce overheads and improve performance.

Nov 19 '21 20:11 prckent

@ye-luo and @prckent I have a prototype implementation of the behavior you mention. I'd like to do some performance stress test to make sure the patch does not induce overheads when there is no chance of benefit from asynchronous/delayed execution. Do you have a specific test or configuration of proxy app that benefits from the suggested behavior?

Jan 28 '22 15:01 carlobertolli

@carlobertolli I got confused. Were you not asking performance tests to show the benefit of your optimization prototype?

Jan 28 '22 16:01 ye-luo

Have you see huge reduction in a trace timeline with the following example

int a
#pragma omp target map(tofrom: a)
{ a = a*2 }

Since the example mostly exercises overhead. I assume we should see a huge gain.

Jan 28 '22 16:01 ye-luo

More than a reduction, there is a shift in latency reporting. In the new patch, as you guessed, we enqueue the wait for the memcopy host-to-device of 'a' in the same queue where we enqueue the launch of the kernel (technically, we enqueue an HSA barrier-and packet). Because the memory copy is just one and for a small amount of data, you are overlapping that copy with the process of launching a kernel (determine the correct launch parameters, get an HSA queue, etc.). There is a performance gain, but I would not classify it as huge.

The performance gain is much larger for target regions that copy in a large amount of big arrays, because those copies are not waited for sequentially. I asked because I thought you had a specific "real-world" application with a target region that looked like this.

Jan 28 '22 16:01 carlobertolli

@carlobertolli Here is the a benchmark you may try with from QMCPACK performance tests. performance-NiO-a64-e768-batched_driver-w16-DU32-1-4 test case Make a real+mixed precision build. Here is the recipe for AOMP 14.0_1

cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DQMC_MPI=OFF  \
    -DQMC_OFFLOAD_ROCM_WORKAROUND_BRANCH_IN_PARALLEL=OFF \
    -DENABLE_OFFLOAD=ON -DOFFLOAD_ARCH=gfx906 -DQMC_MIXED_PRECISION=ON \
    -DQMC_DATA=<your_data_folde_for_benchmarkr> \
    <QMCPACK_source_folder>
make -j32

make sure you run QMCPACK deterministic tests ctest -R deter -j32 --output-on-failure before running benchmarks. Some tests may fail due to numerical errors. Email me if you want me to assess the results.

Here are benchmark runs from last night. https://cdash.qmcpack.org/CDash/testSummary.php?project=1&name=performance-NiO-a64-e768-batched_driver-w16-DU32-1-4&date=2022-01-30 ROCm-Offload-HIP-Real-Mixed-Release takes twice time on Radeon VII than ClangDev-Offload-Real-Mixed-Release on RTX 3060Ti.

The build recipe I provide above are different from the nightly runs. The above recipe doesn't connect HIP part and thus heavier on OpenMP offload.

To enable HIP and roc accelerated libraries. This enables pinned host memory.

-DENABLE_CUDA=ON -DQMC_CUDA2HIP=ON -DHIP_ARCH=gfx906

You might get CMake troubles due to messy CMake in ROCm 4.5 release.

To stress the compute, input file parameters needs to be changed. Without changing input files, this test mostly stress the runtime which you are looking at.

Email me If you need additional help.

Jan 31 '22 00:01 ye-luo

From this offload region. https://github.com/ye-luo/miniqmc/blob/c0b6c89746e424f5b198bd63ad63dd5bb5cd12cf/src/QMCWaveFunctions/einspline_spo_omp.cpp#L413 I still see two synchronization(single_wait_scaquire) being used Screenshot from 2022-02-25 21-51-11 However, with the CUDA plugin. only one synchronize is needed.

Feb 26 '22 03:02 ye-luo

@ye-luo I believe we can close this one? The focus on development of this feature is shifting to upstreaming to llvm trunk.

Apr 20 '22 18:04 carlobertolli

I'm thinking of leaving it open here. Once we improve the performance in upstream and propagate the change back to AOMP, then we close it here.

Apr 20 '22 18:04 ye-luo

This is fixed by adopting plugin-nextgen.

Apr 28 '23 15:04 ye-luo

aomp aomp copied to clipboard

Reduce synchronization needed

aomp
aomp copied to clipboard