aomp
aomp copied to clipboard
Reduce synchronization needed
See #160 figure. Currently every asynchronous transfer and kernel_dispatch within a target offload has single_wait_scaquire (synchronization) right after. It should be nice to optimize them by enqueuing H2D, kernel_dispatch, D2H and synchronization into an in-order queue. There is only one synchronization needed. This optimization is available in the upstream offload library with CUDA plugin.
Async is not yet used by the amdgpu backend. I"m expecting performance improvement from moving to it.
@JonChesterfield What is status on async transfer?
Wondering what the status/possible timelines here might be? It would help reduce overheads and improve performance.
@ye-luo and @prckent I have a prototype implementation of the behavior you mention. I'd like to do some performance stress test to make sure the patch does not induce overheads when there is no chance of benefit from asynchronous/delayed execution. Do you have a specific test or configuration of proxy app that benefits from the suggested behavior?
@carlobertolli I got confused. Were you not asking performance tests to show the benefit of your optimization prototype?
Have you see huge reduction in a trace timeline with the following example
int a
#pragma omp target map(tofrom: a)
{ a = a*2 }
Since the example mostly exercises overhead. I assume we should see a huge gain.
More than a reduction, there is a shift in latency reporting. In the new patch, as you guessed, we enqueue the wait for the memcopy host-to-device of 'a' in the same queue where we enqueue the launch of the kernel (technically, we enqueue an HSA barrier-and packet). Because the memory copy is just one and for a small amount of data, you are overlapping that copy with the process of launching a kernel (determine the correct launch parameters, get an HSA queue, etc.). There is a performance gain, but I would not classify it as huge.
The performance gain is much larger for target regions that copy in a large amount of big arrays, because those copies are not waited for sequentially. I asked because I thought you had a specific "real-world" application with a target region that looked like this.
@carlobertolli Here is the a benchmark you may try with from QMCPACK performance tests.
performance-NiO-a64-e768-batched_driver-w16-DU32-1-4 test case
Make a real+mixed precision build. Here is the recipe for AOMP 14.0_1
cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DQMC_MPI=OFF \
-DQMC_OFFLOAD_ROCM_WORKAROUND_BRANCH_IN_PARALLEL=OFF \
-DENABLE_OFFLOAD=ON -DOFFLOAD_ARCH=gfx906 -DQMC_MIXED_PRECISION=ON \
-DQMC_DATA=<your_data_folde_for_benchmarkr> \
<QMCPACK_source_folder>
make -j32
make sure you run QMCPACK deterministic tests ctest -R deter -j32 --output-on-failure
before running benchmarks. Some tests may fail due to numerical errors. Email me if you want me to assess the results.
Here are benchmark runs from last night. https://cdash.qmcpack.org/CDash/testSummary.php?project=1&name=performance-NiO-a64-e768-batched_driver-w16-DU32-1-4&date=2022-01-30 ROCm-Offload-HIP-Real-Mixed-Release takes twice time on Radeon VII than ClangDev-Offload-Real-Mixed-Release on RTX 3060Ti.
The build recipe I provide above are different from the nightly runs. The above recipe doesn't connect HIP part and thus heavier on OpenMP offload.
To enable HIP and roc accelerated libraries. This enables pinned host memory.
-DENABLE_CUDA=ON -DQMC_CUDA2HIP=ON -DHIP_ARCH=gfx906
You might get CMake troubles due to messy CMake in ROCm 4.5 release.
To stress the compute, input file parameters needs to be changed. Without changing input files, this test mostly stress the runtime which you are looking at.
Email me If you need additional help.
From this offload region. https://github.com/ye-luo/miniqmc/blob/c0b6c89746e424f5b198bd63ad63dd5bb5cd12cf/src/QMCWaveFunctions/einspline_spo_omp.cpp#L413
I still see two synchronization(single_wait_scaquire) being used
However, with the CUDA plugin. only one synchronize is needed.
@ye-luo I believe we can close this one? The focus on development of this feature is shifting to upstreaming to llvm trunk.
I'm thinking of leaving it open here. Once we improve the performance in upstream and propagate the change back to AOMP, then we close it here.
This is fixed by adopting plugin-nextgen.