openmm
openmm copied to clipboard
One loop hangs Amoeba on Windows HIP
After hundreds of display driver resets, I finally tracked down what hangs in solveDIISMatrix
when iteration > 0
on Windows HIP.
https://github.com/openmm/openmm/blob/065e34ab56259d56dae81655f14fcfcbdb077788/plugins/amoeba/platforms/common/src/kernels/multipoleInducedField.cc#L765-L769
The issue is resolved by adding #pragma unroll 1
. This hang only happens on Windows HIP. Ubuntu 20.04 compiles kernels using Clang 15.x and Windows HIP bundles Clang 17.0.0 and could affect how the kernel is optimized by /O3
. Since this is OS dependent and the code is in common, I'm not sure what the best way to implement the fix would be.
RX 6600 on Windows HIP. The massive amoebagk improvement is inline with previously reported Linux HIP results.
Benchmark | OpenCL ns/day | HIP ns/day | Ratio |
---|---|---|---|
amoebagk | 1.38817 | 15.4375 | 1112.08% |
amoebapme | 3.61985 | 6.11883 | 169.04% |
That sounds like a compiler bug. It's a really simple loop, and unrolling should never affect correctness. We should probably notify AMD about it.
@ex-rzr do you have any insight about this?
I spent some time trying to build and run OpenMM-HIP on Windows with an older pre-release HIP SDK version a few months ago. I didn't finish my attempt then because it was my personal initiative (OpenMM-HIP is supported by AMD developers currently) and needed to switch to another project. And I definitely remember some issues with AMOEBA but I'm not sure whether they were in tests or benchmarks and what kind of issues (hangs or just failures).
@bdenhollander I see that you've done a lot of work on supporting Windows: https://github.com/bdenhollander/openmm-hip/commits/windows-compatibility I'll try to find time next week and build and run your branch on our GPU. Unfortunately it's also RDNA. If it's indeed a compiler bug then trying it on another architecture may provide some additional info.
Good job with narrowing it down, btw! Compiler issues are incredibly hard to debug because they often disappear with a minor change in code like adding printf or commenting something.
Further steps to understand what happens may be running with OPENMM_SAVE_TEMPS=pathwheresavetempsto
on Linux and Windows and comparing corresponding .s files. If you are going to do this, please upload them, I'm also curious.
As a workaround I have a crazy idea to "patch" a source code before compiling in HipContext::createModule
: find for (k = 0; k < rank; k++) {
in a string and add #pragma unroll 1\n
before it. It sounds a bit stupid but it does not require modifying main sources and hence depending on a very recent version of OpenMM.
@jdmaia Have you heard about similar issues on HIP SDK for Windows?
Deja vu. This swapping code reminds me one compiler bug. We had an issue with one test in rocPRIM: https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/develop/rocprim/include/rocprim/warp/detail/warp_sort_stable.hpp#L105
As you can see, the code is very similar. So I found the exact place in a compiled code where the compiler missed one instructions (or more precisely: incorrectly removed one instruction). Then AMD's compiler developer fixed this bug: https://reviews.llvm.org/rGdf1782c2a2af9938ba4c5bacfab20d1ddebc82dd
I wonder if this is indeed the same compiler bug and if the compiler from HIP SDK for Windows includes the fix. @bdenhollander Could you check clang's version (with a commit hash) with hipconfig?
I generated .s files with and without #pragma unroll 1
on Windows. Diffing ignoring whitespace and numbers shows differences in lines 3561-4090.
unroll-windows-hip-amdgcn-amd-amdhsa-gfx1032.s.txt hang-windows-hip-amdgcn-amd-amdhsa-gfx1032.s.txt
hipcc.bin.exe prints clang version 17.0.0 ([email protected]:Compute-Mirrors/llvm-project e3201662d21c48894f2156d302276eb1cf47c7be)
when it launches.
Retested this with Windows HIP SDK 5.7.1 and #pragma unroll 1
is still needed. The clang hash reported by hipcc.bin.exe has changed since 5.5.1.
C:\AMD\ROCm\5.7\bin\hipcc.bin.exe --version
HIP version: 5.7.32000-193a0b56e
clang version 17.0.0 ([email protected]:Compute-Mirrors/llvm-project 6e709f613348e5258188527d11ee8d78376f26b7)
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\AMD\ROCm\5.7\bin
Retested with Windows HIP SDK 6.1.2 and the compiler issue is resolved. #pragma unroll 1
is no longer required.
C:\AMD\ROCm\6.1\bin>hipcc.bin.exe --version
HIP version: 6.1.40252-53f3e11ac
clang version 19.0.0git ([email protected]:Compute-Mirrors/llvm-project b3dbdf4f03718d63a3292f784216fddb3e73d521)
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\AMD\ROCm\6.1\bin