llvm-project icon indicating copy to clipboard operation
llvm-project copied to clipboard

[Issue]: Cannot compile Fortran do concurrent for AMD GPU with new AMDFLANG compiler

Open sumseq opened this issue 1 year ago • 14 comments

Problem Description

I am trying to use the new "AMD Modern Fortran Compiler" described here:
https://github.com/amd/InfinityHub-CI/tree/main/fortran
on my code that uses "do concurrent" for GPU-offload with optional OpenMP Target data movement (for GPUs/compiler that do not support unified memory).

The code is "HipFT" located publicly here:
github.com/predsci/hipft

The code works on NVIDIA GPUs with nvfortran and HPE, and on Intel GPUs with ifx.
It also compiles and runs on AMD server GPUs with HPE's CCE compiler (see https://arxiv.org/pdf/2408.07843)

I have compiled HDF5 1.14.3 (with a configure fix) and OpenMPI 5.0.6 with the amdflang and amdclang compiler to link to the code.

When I try to compiler with:
-O3 -fopenmp -fdo-concurrent-parallel=device --offload-arch=gfx906
I get:

LLVM ERROR: aborting  
make: *** [Makefile:25: hipft.o] Error 1

I am using 'mpif90' to compile the code which is using the amdflang:

$ mpif90 -show
amdflang -I/opt/psi/amd/ext_deps/deps/openmpi-5.0.6/include -I/opt/psi/amd/ext_deps/deps/openmpi-5.0.6/lib -L/opt/psi/amd/ext_deps/deps/openmpi-5.0.6/lib -Wl,-rpath -Wl,/opt/psi/amd/ext_deps/deps/openmpi-5.0.6/lib -Wl,--enable-new-dtags -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi

If I try to compile without any OpenMP or Do Concurrent flags, the code compiles fine and runs correctly on 1 CPU core.

If I try to compile with just openmp turned on, and "do concurrent" set to host I get a lot of serialization warnings: warning: loc("/home/caplanr/hipft/git_amd/src/hipft.f90":7683:7): Some do concurrent loops are not perfectly-nested. These will be serialzied. These concern me since if I cannot use DC with index ranges like "2:N-1" than I doubt the code will parallelize at all on either the GPU or CPU since a LOT of the loops are like that.

Note I also had to use: -L/opt/amdfort/llvm/lib -lomptarget in this case otherwise it cannot find the OpenMP target data movement symbols (although they should not be being used in this case....).

Any help would be appreciated as I plan to present the code at SIAM's CSE meeting in a few months and would really like to have some AMD results.

-- Ron

Operating System

Rocky Linux 9.5 (Blue Onyx)

CPU

Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

GPU

AMD Radeon VII, gfx906, amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-, , amdgcn-amd-amdhsa--gfx9-generic:sramecc+:xnack-

ROCm Version

ROCm 6.2.3

ROCm Component

flang

Steps to Reproduce

My rocm is actually 6.2.4, but that is not on the list. My linux kernel is: edge 6.10.6-1.el9.elrepo.x86_64

To reproduce, install the new AMD flang compiler from: https://github.com/amd/InfinityHub-CI/tree/main/fortran

Next, clone the repo:

git clone https://github.com/predsci/hipft

Then, copy one of the build scripts from the build_examples folder and edit the top portion to resemble this:

FC="mpif90"
HDF5_INCLUDE_DIR="${PS_EXT_DEPS_HOME}/hdf5/include"
HDF5_LIB_DIR="${PS_EXT_DEPS_HOME}/hdf5/lib"
HDF5_LIB_FLAGS="-lhdf5_fortran -lhdf5hl_fortran -lhdf5 -lhdf5_hl"
FFLAGS="-O3 -fopenmp --offload-arch=gfx906 -fdo-concurrent-parallel=device"

But:

  • Replace the HDF5 paths with the ones to a HDF5 library compiled with amdflang.
  • The mpif90 should also be associated with an MPI library compiled with amdflang.
  • Replace the gfx906 with the correct GPU arch you are using.

Now, try to run the build script in the top level directory of the repo.

You should see:

./build_amd_gpu.sh
=== STARTING HIPFT BUILD ===
==> Entering src directory...
==> Removing old Makefile...
==> Generating Makefile from Makefile.template...
==> Compiling code...
!!> ERROR!  hipft executable not found.  Build most likely failed.
            Contents of src/build.err:
LLVM ERROR: aborting
make: *** [Makefile:25: hipft.o] Error 1

You can go into the src folder and try to edit the Makefile and recompile as needed.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

$ /opt/rocm/bin/rocminfo --support ROCk module is loaded

HSA System Attributes

Runtime Version: 1.14 Runtime Ext Version: 1.6 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED DMAbuf Support: YES

==========
HSA Agents


Agent 1


Name: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz Uuid: CPU-XX
Marketing Name: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4000
BDFID: 0
Internal Node ID: 0
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx906
Uuid: GPU-b86490a172da5ee9
Marketing Name: AMD Radeon VII
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 26287(0x66af)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1801
BDFID: 3584
Internal Node ID: 1
Compute Unit: 60
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 472
SDMA engine uCode:: 145
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16760832(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 16760832(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx9-generic:sramecc+:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

Additional Information

Here are my installed amd and rocm packages:

$ sudo dnf list --installed | grep amd
50:amd-smi-lib.x86_64                                     24.6.3.60204-139.el9                @rocm                      
51:amdgpu-core.noarch                                     1:6.2.60204-2070768.el9             @amdgpu                    
52:amdgpu-dkms.noarch                                     1:6.8.5.60204-2070768.el9           @amdgpu                    
53:amdgpu-dkms-firmware.noarch                            1:6.8.5.60204-2070768.el9           @amdgpu                    
54:amdgpu-install.noarch                                  6.2.60204-2070768.el9               @@commandline              
280:dkms.noarch                                            3.1.0-2.el9                         @amdgpu                    
569:hip-runtime-amd.x86_64                                 6.2.41134.60204-139.el9             @rocm                      
592:hsa-amd-aqlprofile.x86_64                              1.0.0.60204.60204-139.el9           @rocm                      
990:libdrm-amdgpu.x86_64                                   1:2.4.120.60204-2070768.el9         @amdgpu                    
991:libdrm-amdgpu-common.noarch                            1.0.0.60204-2070768.el9             @amdgpu                    
992:libdrm-amdgpu-devel.x86_64                             1:2.4.120.60204-2070768.el9         @amdgpu                    
1458:mesa-amdgpu-dri-drivers.x86_64                         1:24.2.0.60204-2070768.el9          @amdgpu                    
1459:mesa-amdgpu-filesystem.x86_64                          1:24.2.0.60204-2070768.el9          @amdgpu                    
1460:mesa-amdgpu-libGL.x86_64                               1:24.2.0.60204-2070768.el9          @amdgpu                    
1461:mesa-amdgpu-va-drivers.x86_64                          1:24.2.0.60204-2070768.el9          @amdgpu                    
2230:teamd.x86_64                                           1.31-16.el9_1                       @baseos          
$ sudo dnf list --installed | grep rocm
50:amd-smi-lib.x86_64                                     24.6.3.60204-139.el9                @rocm                      
178:comgr.x86_64                                           2.8.0.60204-139.el9                 @rocm                      
179:composablekernel-devel.x86_64                          1.1.0.60204-139.el9                 @rocm                      
558:half.x86_64                                            1.12.0.60204-139.el9                @rocm                      
567:hip-devel.x86_64                                       6.2.41134.60204-139.el9             @rocm                      
568:hip-doc.x86_64                                         6.2.41134.60204-139.el9             @rocm                      
569:hip-runtime-amd.x86_64                                 6.2.41134.60204-139.el9             @rocm                      
570:hip-samples.x86_64                                     6.2.41134.60204-139.el9             @rocm                      
571:hipblas.x86_64                                         2.2.0.60204-139.el9                 @rocm                      
572:hipblas-devel.x86_64                                   2.2.0.60204-139.el9                 @rocm                      
573:hipblaslt.x86_64                                       0.8.0.60204-139.el9                 @rocm                      
574:hipblaslt-devel.x86_64                                 0.8.0.60204-139.el9                 @rocm                      
575:hipcc.x86_64                                           1.1.1.60204-139.el9                 @rocm                      
576:hipcub-devel.x86_64                                    3.2.1.60204-139.el9                 @rocm                      
577:hipfft.x86_64                                          1.0.16.60204-139.el9                @rocm                      
578:hipfft-devel.x86_64                                    1.0.16.60204-139.el9                @rocm                      
579:hipfort-devel.x86_64                                   0.4.0.60204-139.el9                 @rocm                      
580:hipify-clang.x86_64                                    18.0.0.60204-139.el9                @rocm                      
581:hiprand.x86_64                                         2.11.1.60204-139.el9                @rocm                      
582:hiprand-devel.x86_64                                   2.11.1.60204-139.el9                @rocm                      
583:hipsolver.x86_64                                       2.2.0.60204-139.el9                 @rocm                      
584:hipsolver-devel.x86_64                                 2.2.0.60204-139.el9                 @rocm                      
585:hipsparse.x86_64                                       3.1.1.60204-139.el9                 @rocm                      
586:hipsparse-devel.x86_64                                 3.1.1.60204-139.el9                 @rocm                      
587:hipsparselt.x86_64                                     0.2.1.60204-139.el9                 @rocm                      
588:hipsparselt-devel.x86_64                               0.2.1.60204-139.el9                 @rocm                      
589:hiptensor.x86_64                                       1.3.0.60204-139.el9                 @rocm                      
590:hiptensor-devel.x86_64                                 1.3.0.60204-139.el9                 @rocm                      
592:hsa-amd-aqlprofile.x86_64                              1.0.0.60204.60204-139.el9           @rocm                      
593:hsa-rocr.x86_64                                        1.14.0.60204-139.el9                @rocm                      
594:hsa-rocr-devel.x86_64                                  1.14.0.60204-139.el9                @rocm                      
595:hsakmt-roct-devel.x86_64                               20240607.5.7.60204-139.el9          @rocm                      
1477:migraphx.x86_64                                        2.10.0.60204-139.el9                @rocm                      
1478:migraphx-devel.x86_64                                  2.10.0.60204-139.el9                @rocm                      
1480:miopen-hip.x86_64                                      3.2.0.60204-139.el9                 @rocm                      
1481:miopen-hip-devel.x86_64                                3.2.0.60204-139.el9                 @rocm                      
1482:mivisionx.x86_64                                       3.0.0.60204-139                     @rocm                      
1483:mivisionx-devel.x86_64                                 3.0.0.60204-139                     @rocm                      
1574:openmp-extras-devel.x86_64                             18.62.0.60204-139.el9               @rocm                      
1575:openmp-extras-runtime.x86_64                           18.62.0.60204-139.el9               @rocm                      
2048:rccl.x86_64                                            2.20.5.60204-139.el9                @rocm                      
2049:rccl-devel.x86_64                                      2.20.5.60204-139.el9                @rocm                      
2058:rocalution.x86_64                                      3.2.1.60204-139.el9                 @rocm                      
2059:rocalution-devel.x86_64                                3.2.1.60204-139.el9                 @rocm                      
2060:rocblas.x86_64                                         4.2.4.60204-139.el9                 @rocm                      
2061:rocblas-devel.x86_64                                   4.2.4.60204-139.el9                 @rocm                      
2062:rocdecode.x86_64                                       0.6.0.60204-139                     @rocm                      
2063:rocdecode-devel.x86_64                                 0.6.0.60204-139                     @rocm                      
2064:rocfft.x86_64                                          1.0.30.60204-139.el9                @rocm                      
2065:rocfft-devel.x86_64                                    1.0.30.60204-139.el9                @rocm                      
2073:rocm.x86_64                                            6.2.4.60204-139.el9                 @rocm                      
2074:rocm-cmake.x86_64                                      0.13.0.60204-139.el9                @rocm                      
2075:rocm-core.x86_64                                       6.2.4.60204-139.el9                 @rocm                      
2076:rocm-dbgapi.x86_64                                     0.76.0.60204-139.el9                @rocm                      
2077:rocm-debug-agent.x86_64                                2.0.3.60204-139.el9                 @rocm                      
2078:rocm-developer-tools.x86_64                            6.2.4.60204-139.el9                 @rocm                      
2079:rocm-device-libs.x86_64                                1.0.0.60204-139.el9                 @rocm                      
2080:rocm-gdb.x86_64                                        14.2.60204-139.el9                  @rocm                      
2081:rocm-hip-libraries.x86_64                              6.2.4.60204-139.el9                 @rocm                      
2082:rocm-hip-runtime.x86_64                                6.2.4.60204-139.el9                 @rocm                      
2083:rocm-hip-runtime-devel.x86_64                          6.2.4.60204-139.el9                 @rocm                      
2084:rocm-hip-sdk.x86_64                                    6.2.4.60204-139.el9                 @rocm                      
2085:rocm-language-runtime.x86_64                           6.2.4.60204-139.el9                 @rocm                      
2086:rocm-llvm.x86_64                                       18.0.0.24392.60204-139.el9          @rocm                      
2087:rocm-ml-libraries.x86_64                               6.2.4.60204-139.el9                 @rocm                      
2088:rocm-ml-sdk.x86_64                                     6.2.4.60204-139.el9                 @rocm                      
2089:rocm-opencl.x86_64                                     2.0.0.60204-139.el9                 @rocm                      
2090:rocm-opencl-devel.x86_64                               2.0.0.60204-139.el9                 @rocm                      
2091:rocm-opencl-icd-loader.x86_64                          1.2.60204-139.el9                   @rocm                      
2092:rocm-opencl-runtime.x86_64                             6.2.4.60204-139.el9                 @rocm                      
2093:rocm-opencl-sdk.x86_64                                 6.2.4.60204-139.el9                 @rocm                      
2094:rocm-openmp-sdk.x86_64                                 6.2.4.60204-139.el9                 @rocm                      
2095:rocm-smi-lib.x86_64                                    7.3.0.60204-139.el9                 @rocm                      
2096:rocm-utils.x86_64                                      6.2.4.60204-139.el9                 @rocm                      
2097:rocminfo.x86_64                                        1.0.0.60204-139.el9                 @rocm                      
2098:rocprim-devel.x86_64                                   3.2.2.60204-139.el9                 @rocm                      
2099:rocprofiler.x86_64                                     2.0.60204.60204-139.el9             @rocm                      
2100:rocprofiler-devel.x86_64                               2.0.60204.60204-139.el9             @rocm                      
2101:rocprofiler-plugins.x86_64                             2.0.60204.60204-139.el9             @rocm                      
2102:rocprofiler-register.x86_64                            0.4.0.60204-139.el9                 @rocm                      
2103:rocprofiler-sdk.x86_64                                 0.4.0-139.el9                       @rocm                      
2104:rocprofiler-sdk-roctx.x86_64                           0.4.0-139.el9                       @rocm                      
2105:rocrand.x86_64                                         3.1.1.60204-139.el9                 @rocm                      
2106:rocrand-devel.x86_64                                   3.1.1.60204-139.el9                 @rocm                      
2107:rocsolver.x86_64                                       3.26.2.60204-139.el9                @rocm                      
2108:rocsolver-devel.x86_64                                 3.26.2.60204-139.el9                @rocm                      
2109:rocsparse.x86_64                                       3.2.1.60204-139.el9                 @rocm                      
2110:rocsparse-devel.x86_64                                 3.2.1.60204-139.el9                 @rocm                      
2111:rocthrust-devel.x86_64                                 3.1.1.60204-139.el9                 @rocm                      
2112:roctracer.x86_64                                       4.1.60204.60204-139.el9             @rocm                      
2113:roctracer-devel.x86_64                                 4.1.60204.60204-139.el9             @rocm                      
2114:rocwmma-devel.x86_64                                   1.5.0.60204-139.el9                 @rocm                      
2127:rpp.x86_64                                             1.8.0.60204-139.el9                 @rocm                      
2128:rpp-devel.x86_64                                       1.8.0.60204-139.el9                 @rocm        

sumseq avatar Nov 27 '24 00:11 sumseq

Hi @sumseq. Internal ticket has been created to assist with your issue. Thanks!

ppanchad-amd avatar Dec 23 '24 16:12 ppanchad-amd

Hi @sumseq, can you try building again with the latest version of the new flang (available here)? I was able to build successfully with the -fdo-concurrent-parallel=host flag using drop 5.1.0. I have included the steps I followed below:

Getting the new flang

  • Download the latest flang tarball for your distro from https://repo.radeon.com/rocm/misc/flang/ and untar it
  • add <path to install>/bin to your PATH and <path to install>/lib to LD_LIBRARY_PATH.

Building MPI:

  • Get the source code from https://www-lb.open-mpi.org/software/ (I used version 5.0.6) and untar it

  • Then cd to the directory containing the source code and run the following configure command: ./configure --prefix=<path-to-mpi-code-dir>/install F77=<path to install>/bin/amdflang FC=<path to install>/bin/amdflang

  • Then do make all install

  • Reference: https://www.open-mpi.org/faq/?category=building, https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/compilers-and-flags.html#

Building HDF5:

  • Get the source code from https://www.hdfgroup.org/download-hdf5/source-code/ (I used version 1.14.5) and untar it
  • Then cd to the directory containing the source code and run the following configure command: ./configure --enable-fortran --enable-shared --enable-static --enable-optimization=high --disable-tests --enable-parallel --prefix=<path-to-hdf5-code-dir>/install F77="<path-to-mpi-code-dir>/install/bin/mpif77" FC="<path-to-mpi-code-dir>/install/bin/mpif90"
  • libtool is broken for the new flang so run this after the configure: sed -i 's/wl=""/wl="-Wl,"/g;s/pic_flag=""/pic_flag=" -fPIC -DPIC"/g' libtool
  • Then do make, make check, and make install
  • Reference: https://github.com/HDFGroup/hdf5/blob/develop/release_docs/INSTALL_Autotools.txt, https://github.com/HDFGroup/hdf5/blob/develop/release_docs/INSTALL_parallel

Building hipft

Same as described in the original issue:

git clone https://github.com/predsci/hipft

Then, copy one of the build scripts from the build_examples folder and edit the top portion to resemble this:

FC="mpif90"
HDF5_INCLUDE_DIR="${PS_EXT_DEPS_HOME}/hdf5/include"
HDF5_LIB_DIR="${PS_EXT_DEPS_HOME}/hdf5/lib"
HDF5_LIB_FLAGS="-lhdf5_fortran -lhdf5hl_fortran -lhdf5 -lhdf5_hl"
FFLAGS="-O3 -fopenmp --offload-arch=gfx906 -fdo-concurrent-parallel=device"

but with

FC="<path-to-mpi-code-dir>/install/bin/mpif90"
HDF5_INCLUDE_DIR="<path-to-hdf5-code-dir>/install/include"
HDF5_LIB_DIR="<path-to-hdf5-code-dir>/install/lib"
HDF5_LIB_FLAGS="-lhdf5_fortran -lhdf5hl_fortran -lhdf5 -lhdf5_hl
FFLAGS="-O3 -fopenmp --offload-arch=gfx906 -fdo-concurrent-parallel=host"

sohaibnd avatar Jan 17 '25 23:01 sohaibnd

Hi,

I was not able to follow your mpi and hdf5 build instructions exactly, but I was able to install them with the new flang using our external library builder.

I was able to compile hipft with those flags but see:

EDGE_CPU: ~/hipft/git_amd $ more src/build.err 
warning: loc("/home/caplanr/hipft/git_amd/src/hipft.f90":7479:7): Some `do concurent` loops are not perfectly-nested. Th
ese will be serialzied.
warning: loc("/home/caplanr/hipft/git_amd/src/hipft.f90":7695:7): Some `do concurent` loops are not perfectly-nested. Th
ese will be serialzied.
warning: loc("/home/caplanr/hipft/git_amd/src/hipft.f90":7762:7): Some `do concurent` loops are not perfectly-nested. Th
ese will be serialzied.
flang-new: warning: argument unused during compilation: '-fdo-concurrent-parallel=host' [-Wunused-command-line-argument]

The warnings indicate a performance issue, but the last output seems to be saying it is ignoring the do concurrent flag (even though it seems to have tried to parallelized them anyways..)

I tried to run the code but got:

hipft: error while loading shared libraries: libffi.so.6: cannot open shared object file: No such file or directory

This is an issue I encountered before and it seems to be due to Rocky 9.5 using a newer version of libffi. I was able to get around this by including the library path in my conda python environment which has the version needed.

The code now runs, but when I run the testsuite, one of the tests failed with a small error:

FAIL in FLUX_POSITIVE:
        4.569453187963095E+020
        4.570258106521247E+020

But another test failed much worse:

FAIL in FLUX_POSITIVE:
        1.069313097596278E+022
        0.000000000000000E+000

The first test of the testsuite did pass however, so I am thinking there must be a specific part of the code that is the problem but I am not sure how to identify that.

Just to test, I removed the GPU architecture flag, and then I can compile but when running I get:

[edge:333724] *** Process received signal ***
[edge:333724] Signal: Segmentation fault (11)
[edge:333724] Signal code: Address not mapped (1)
[edge:333724] Failing at address: 0x250
[edge:333724] [ 0] /lib64/libc.so.6(+0x3e730)[0x7ffb9263e730]
[edge:333724] [ 1] /lib64/libc.so.6(pthread_mutex_lock+0x4)[0x7ffb9268d004]
[edge:333724] [ 2] /home/opt/amd/amdflang/lib/llvm/bin/../lib/libomptarget.so.20.0git(+0x1d8782f)[0x7ffb94b8782f]
[edge:333724] [ 3] /home/opt/amd/amdflang/lib/llvm/bin/../lib/libomptarget.so.20.0git(+0x1d81532)[0x7ffb94b81532]
[edge:333724] [ 4] /home/opt/amd/amdflang/lib/llvm/bin/../lib/libomptarget.so.20.0git(+0x1d81acc)[0x7ffb94b81acc]
[edge:333724] [ 5] /home/opt/amd/amdflang/lib/llvm/bin/../lib/libomptarget.so.20.0git(__tgt_target_data_begin_mapper+0x20b)[0x7ffb94b8187b]
[edge:333724] [ 6] hipft[0x2604b4]
[edge:333724] [ 7] hipft[0x23e281]
[edge:333724] [ 8] hipft[0x2357e8]
[edge:333724] [ 9] hipft[0x234903]
[edge:333724] [10] hipft[0x2764ad]
[edge:333724] [11] /lib64/libc.so.6(+0x295d0)[0x7ffb926295d0]
[edge:333724] [12] /lib64/libc.so.6(__libc_start_main+0x80)[0x7ffb92629680]
[edge:333724] [13] hipft[0x234845]
[edge:333724] *** End of error message ***

I also tried to compile with the original flags but with "device" instead of "host" and got:

PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
flang-new: error: unable to execute command: Segmentation fault (core dumped)
flang-new: error: flang frontend command failed due to signal (use -v to see invocation)
AMD AFAR drop #5.1 12/06/24 flang-new version 20.0.0git (ssh://gerritgit/lightning/ec/llvm-project  24492 7230547be29cf698d5a025bff4bb89da31bdb0c8)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/opt/amd/amdflang/lib/llvm/bin
Build config: +assertions
Configuration file: /home/opt/amd/amdflang/lib/llvm/bin/flang.cfg
flang-new: note: diagnostic msg: 
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
flang-new: note: diagnostic msg: /tmp/hipft-gfx906-0c6212
flang-new: note: diagnostic msg: /tmp/hipft-caad58
flang-new: note: diagnostic msg: /tmp/hipft-gfx906-0c6212.sh
flang-new: note: diagnostic msg: 

sumseq avatar Jan 21 '25 01:01 sumseq

UPDATE (and some progress!):

I have installed the latest version of ROCM I can (6.3.2) as well as the latest amdflang compiler: rocm-afar-7110-drop-5.3.0-rhel.tar.bz2 I have also updated the drivers through the use of the repo: baseurl=https://repo.radeon.com/amdgpu/6.3.2/rhel/$amdgpudistro/proprietary/x86_64

I encountered two compiler issues:

  1. In my nested DC loops (3 total) that have a reduction in the inner loop (an array reduction overall) I got: error: loc("/home/caplanr/hipft/git_amd/src/hipft.f90":7689:9): 'fir.do_loop' op using value defined outside the region To work around this, I changed the inner DC loop to a standard do loop (which will slow things down but getting it working is a first priority).

  2. I got this linking error:

ld.lld: error: undefined symbol: _FortranAModuloReal8
>>> referenced by /tmp/hipft.amdgcn.gfx906-6dafcd.img.lto.o:(__omp_offloading_fd02_219bc6e1_analysis_step__l3162)

I had a call to the intrinsic MODULO inside an accelerated DC loop.
I assume that this intrinsic has simply not yet supported within accelerated regions.
Luckily I was able to move it outside the DC loop as it was not needed inside in the first place (it was calculating a loop index-independent scalar).

With the two workarounds above, I was able to compile the code with:

-O3 -fopenmp -fdo-concurrent-parallel=device --offload-arch=gfx906

with the only build error/warning being: flang-new: warning: argument unused during compilation: '-fdo-concurrent-parallel=device' [-Wunused-command-line-argument] (which is odd since without that flag things are different and I am not sure how it would know to offload to the GPU?)

The problem now is that when I try to run the code, the code just hangs with no output but is running hte CPU at 100% but the GPU at 0%:

4183 caplanr 20 0 227652 16308 13748 R 99.7 0.1 14:47.80 hipft

rocm-smi
=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK    MCLK    Fan     Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                     
========================================================================================================================
0       1     0x66af,   59266  32.0°C  18.0W     N/A, N/A, 0         808Mhz  350Mhz  19.61%  auto  250.0W  0%     0%    
========================================================================================================================
amd-smi process
GPU: 0
    PROCESS_INFO: No running processes detected

Could this be because my RADEON VII is no longer supported by the newest ROCm?

Are you able to get the code running on your AMD GPUs?

-- Ron

sumseq avatar Feb 18 '25 00:02 sumseq

@sumseq Thanks for the continuing to look into this. Can you provide the modified hipft code? I can try it on my end with a supported GPU (note that support for do concurrent loops is still a work in progress though so it may not be the GPU). Also, are your tests still failing with -fdo-concurrent-parallel=host on the latest version of the new flang compiler?

sohaibnd avatar Feb 18 '25 16:02 sohaibnd

Hi,

I can compile with -fdo-concurrent-parallel=host but when I try to run the code it fails with:

!!!> hipft.err contents: 

fatal Fortran runtime error(/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/llvm-project/flang/runtime/descriptor.cpp:36): Descriptor::Establish: CFI_establish returned 15 for CFI_type_t(-1)
[edge:05529] *** Process received signal ***
[edge:05529] Signal: Aborted (6)
[edge:05529] Signal code:  (-6)
[edge:05529] [ 0] /lib64/libc.so.6(+0x3e730)[0x7f9a2c63e730]
[edge:05529] [ 1] /lib64/libc.so.6(+0x8ba6c)[0x7f9a2c68ba6c]
[edge:05529] [ 2] /lib64/libc.so.6(raise+0x16)[0x7f9a2c63e686]
[edge:05529] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7f9a2c628833]
[edge:05529] [ 4] /opt/psi/amd/ext_deps/deps/hdf5/lib/libhdf5_fortran.so.310(+0x1a3855)[0x7f9a319a3855]
[edge:05529] [ 5] hipft(_ZN7Fortran7runtime10Descriptor9EstablishERKNS0_8typeInfo11DerivedTypeEPviPKlh+0x1fe)[0x28963e]
[edge:05529] [ 6] hipft(_ZN7Fortran7runtime10InitializeERKNS0_10DescriptorERKNS0_8typeInfo11DerivedTypeERNS0_10TerminatorEbPS2_+0x1596)[0x287d66]
[edge:05529] [ 7] hipft[0x28327c]
[edge:05529] [ 8] hipft[0x26a060]
[edge:05529] [ 9] hipft[0x2480bd]
[edge:05529] [10] hipft[0x241808]
[edge:05529] [11] hipft[0x240923]
[edge:05529] [12] hipft[0x28284d]
[edge:05529] [13] /lib64/libc.so.6(+0x295d0)[0x7f9a2c6295d0]
[edge:05529] [14] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f9a2c629680]
[edge:05529] [15] hipft[0x240865]
[edge:05529] *** End of error message ***

For the modified code, just check out the latest version of HipFT and make these few modifications:

diff --git a/src/hipft.f90 b/src/hipft.f90
index bb38ac7..f24ee3b 100644
--- a/src/hipft.f90
+++ b/src/hipft.f90
@@ -3159,13 +3159,14 @@ subroutine analysis_step
                             diffusion_coef_factor*(time-time_start))
         enddo
       elseif (validation_run .eq. 2) then
-        do concurrent (k=1:nr,j=1:ntm,i=1:npm-1)
-!
-          vtt =       flow_vt_const*km_s_to_rs_hr*(time-time_start)
-          vpt = flow_vp_rigid_omega*km_s_to_rs_hr*(time-time_start)
+!              
+        vtt =       flow_vt_const*km_s_to_rs_hr*(time-time_start)
+        vpt = flow_vp_rigid_omega*km_s_to_rs_hr*(time-time_start)
 !
-          p1 = MODULO(pi_two+vpt,      twopi)
-          p2 = MODULO(threepi_two+vpt, twopi)
+        p1 = MODULO(pi_two+vpt,      twopi)
+        p2 = MODULO(threepi_two+vpt, twopi)
+!        
+        do concurrent (k=1:nr,j=1:ntm,i=1:npm-1)
 !
           fval(i,j,k) = -st_i(j)*EXP(-(t(j) - pi_four - vtt)**2/validation_run_width - &
                                                    (p(i)-p1)**2/validation_run_width)  &
@@ -7686,13 +7687,13 @@ subroutine advection_operator_upwind (ftemp,aop)
       do concurrent (i=1:nr)
         fn = zero
         fs = zero
-        do concurrent(k=2:npm-1) reduce(+:fn,fs)
+        do k=2,npm-1
           fn = fn + flux_t(   2,k,i)*dp(k)
           fs = fs + flux_t(ntm1,k,i)*dp(k)
         enddo
 ! ****** Note that the south pole needs a sign change since the
 ! ****** theta flux direction is reversed.
-        do concurrent(k=2:npm-1)
+        do k=2,npm-1
           aop(  1,k,i) =  fn*bc_flow_npole_fac
           aop(ntm,k,i) = -fs*bc_flow_spole_fac
         enddo
@@ -7902,13 +7903,13 @@ subroutine advection_operator_weno3 (ftemp,aop)
       do concurrent (i=1:nr)
         fn = zero
         fs = zero
-        do concurrent(k=2:npm-1) reduce(+:fn,fs)
+        do k=2,npm-1
           fn = fn + flux_t(   2,k,i)*dp(k)
           fs = fs + flux_t(nt-1,k,i)*dp(k)
         enddo
 ! ****** Note that the south pole needs a sign change since the
 ! ****** theta flux direction is reversed.
-        do concurrent(k=2:npm-1)
+        do k=2,npm-1
           aop(  1,k,i) =  fn*bc_flow_npole_fac
           aop(ntm,k,i) = -fs*bc_flow_spole_fac
         enddo
@@ -7969,7 +7970,7 @@ subroutine diffusion_operator_cd (x,y)
       do concurrent(i=1:nr)
         fn = zero
         fs = zero
-        do concurrent(k=2:npm-1) reduce(+:fn,fs)
+        do k=2,npm-1
           fn = fn + (diffusion_coef(1    ,k,i)        &
                    + diffusion_coef(2    ,k,i))       &
                   * (x(2  ,k,i) - x(1    ,k,i))*dp(k)
@@ -7977,7 +7978,7 @@ subroutine diffusion_operator_cd (x,y)
                    + diffusion_coef(nt   ,k,i))       &
                   * (x(ntm,k,i) - x(ntm-1,k,i))*dp(k)
         enddo
-        do concurrent(k=1:npm)
+        do k=1,npm
           y(  1,k,i) =  fn*bc_diffusion_npole_fac
           y(ntm,k,i) = -fs*bc_diffusion_spole_fac
         enddo

-- Ron

sumseq avatar Feb 18 '25 21:02 sumseq

BTW instead of hanging, now the code produces this error when trying to run on the GPU:

"PluginInterface" error: Failure to run target region 0x000000003a3c9550 in device 0: Mismatch of kernel arguments size
omptarget error: Executing target region abort target.
omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
omptarget error: Source location information not present. Compile with -g or -gline-tables-only.
omptarget fatal error 1: failure of target construct while offloading is mandatory
[edge:87904] *** Process received signal ***
[edge:87904] Signal: Aborted (6)
[edge:87904] Signal code:  (-6)
[edge:87904] [ 0] /lib64/libc.so.6(+0x3e730)[0x7f917603e730]
[edge:87904] [ 1] /lib64/libc.so.6(+0x8ba6c)[0x7f917608ba6c]
[edge:87904] [ 2] /lib64/libc.so.6(raise+0x16)[0x7f917603e686]
[edge:87904] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7f9176028833]
[edge:87904] [ 4] /home/opt/amd/amdflang/lib/llvm/bin/../lib/libomptarget.so.20.0git(+0x1d84b05)[0x7f9178584b05]
[edge:87904] [ 5] /home/opt/amd/amdflang/lib/llvm/bin/../lib/libomptarget.so.20.0git(__tgt_target_kernel+0x16d7)[0x7f917857f977]
[edge:87904] [ 6] hipft[0x36d7cc]
[edge:87904] [ 7] hipft[0x35d246]
[edge:87904] [ 8] hipft[0x35bcb0]
[edge:87904] [ 9] hipft[0x3d37f2]
[edge:87904] [10] /lib64/libc.so.6(+0x295d0)[0x7f91760295d0]
[edge:87904] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f9176029680]
[edge:87904] [12] hipft[0x35bbf5]
[edge:87904] *** End of error message ***

-- Ron

sumseq avatar Feb 18 '25 21:02 sumseq

Hi,

Another update.

If I remove the -fdo-concurrent-parallel=device compile flag, but keep the offload flag --offload-arch=gfx906, the code compiles and seems to run!

I see the process in nvtop but it shows 100%CPU and almost no GPU (although there is indeed some GPU memory being used) and I see it in amg-smi!

EDGE_CPU: ~ $ amd-smi process
GPU: 0
    PROCESS_INFO:
        NAME: hipft
        PID: 88795
        MEMORY_USAGE:
            GTT_MEM: 2.1 MB
            CPU_MEM: 3.4 MB
            VRAM_MEM: 170.1 MB
        MEM_USAGE: 175.6 MB
        USAGE:
            GFX: 0 ns
            ENC: 0 ns

The results of my test suite however show:

===========================================================================
Summary of test results:
===========================================================================
Test name                            PASS/FAIL  Run-time  Ref-time  Speedup
===========================================================================
advect_gaussians_phi                 PASS         13.089     0.621     0.05
advect_gaussians_theta               PASS         13.139     0.605     0.05
advect_gaussians_phi_theta           PASS         13.029     0.597     0.05
diffuse_soccer                       FAIL         40.396     2.403     0.06
diffuse_advect_soccer                FAIL         70.180     3.290     0.05
diffuse_dipole                       PASS        188.376     4.321     0.02
diffuse_advect_atten_map_1cr         PASS        227.003     6.927     0.03
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The timings are not relevant as they are tiny problems, but two of the tests fail. My guess is some problem with the creation of the analytic solution for the "soccer ball" function which uses a GPU accelerated routine that computes spherical harmonics.

The timing for the last test is somewhat relevant, and it is very very slow but I expected that from the changes that were needed for compilation.

I am interested to see what results you can get on your AMD GPU.

-- Ron

sumseq avatar Feb 18 '25 22:02 sumseq

Hello Ron,

I am working on implementing do concurrent parallization in flang and currently focusing on getting this in a better shape on AMD GPUs. Thanks for the report! Happy to collaborate with you trying to get HipFT automatically parallilized using amdflang. It is new feature and still have rough edges to overcome.

I compiled Open MPI, HDF5, and HipFT locally and reproduced your results above:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
===========================================================================
Summary of test results:
===========================================================================
Test name                            PASS/FAIL  Run-time  Ref-time  Speedup
===========================================================================
advect_gaussians_phi                 PASS         10.103     0.621     0.06
advect_gaussians_theta               PASS          9.814     0.605     0.06
advect_gaussians_phi_theta           PASS         15.480     0.597     0.04
diffuse_soccer                       FAIL         22.185     2.403     0.11
diffuse_advect_soccer                FAIL         54.109     3.290     0.06
diffuse_dipole                       PASS         70.098     4.321     0.06
diffuse_advect_atten_map_1cr         PASS        211.970     6.927     0.03
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

If I remove the -fdo-concurrent-parallel=device compile flag, but keep the offload flag --offload-arch=gfx906, the code compiles and seems to run! I see the process in nvtop but it shows 100%CPU and almost no GPU (although there is indeed some GPU memory being used) and I see it in amg-smi!

The reason you see GPU memory usage is that HipFT is transferring data to and from the GPU using !$omp target enter|update|exit directives. The reason you see actual GPU usage at almost 0% is that all the HipFT kernels are inside do concurrent loops, and that removing the fdo-concurrent-parallel=device flag serializes these loops and runs them on the CPU. So the reported numbers in both of our runs are actually CPU numbers.

... The timing for the last test is somewhat relevant, and it is very very slow ...

This is probably explained by the previous part of my reply as well. But cannot be 100% sure here.

The timings are not relevant as they are tiny problems, but two of the tests fail.

If these tests succeed on other compilers, then this might be a general bug in amdflang (i.e. not related in particular to do concurrent or GPU offloading). Do they succeed with the other compilers you tested?

flang-new: warning: argument unused during compilation: '-fdo-concurrent-parallel=host' [-Wunused-command-line-argument]

Regarding this warning, it is emitted because the same flag is used during both compilation steps: the step to produce the object file and the step to produce the actual executable. The compiler only needs (and uses) the flag for the first step. For the second step, it does not, hence it produces the warning. I think this is for sure confusing and I will look into it after resolving the more important issues.

/home/kaergawy/playground/openmpi-5.0.7/install/bin/mpif90 -c -O3 -fopenmp --offload-arch=gfx90a -fdo-concurrent-to-openmp=device -I/home/kaergawy/playground/hdf5-hdf5_1.14.6/install/include -c hipft.f90 -o hipft.o
warning: Mapping `do concurrent` to OpenMP is still experimental.
warning: Mapping `do concurrent` to OpenMP is still experimental.
/home/kaergawy/playground/openmpi-5.0.7/install/bin/mpif90 -O3 -fopenmp --offload-arch=gfx90a -fdo-concurrent-to-openmp=device -I/home/kaergawy/playground/hdf5-hdf5_1.14.6/install/include hipft.o -L/home/kaergawy/playground/hdf5-hdf5_1.14.6/install/lib -lhdf5_fortran -lhdf5hl_fortran -lhdf5 -lhdf5_hl -o hipft
flang-21: warning: argument unused during compilation: '-fdo-concurrent-to-openmp=device' [-Wunused-command-line-argument]

In any case, I am looking into the GPU and do concurrent issues to hopefully reproduce the successful tests on the GPU as well ... 👀 . At the moment, I am getting the same linker error you reported previously:

ld.lld: error: undefined symbol: _FortranAModuloReal8

And this is what I am looking at right now.

ergawy avatar May 06 '25 11:05 ergawy

Update:

  • In order to resolve the _FortranAModuloReal8 linker issue, you need to link with a Fortran runtime library built for the device. Such a library is built and provided in our beta drops which are available here: https://repo.radeon.com/rocm/misc/flang/. To link with the device Fortran RT library, you can add: -lflang_rt.hostdevice.
  • After linking the device RT library, HipFT compiles and links without errors. However, runtime results are incorrect (the test suite tests fail). After looking at the source code, seems most likely due to using reduction which is not supported yet but we are working on adding support for them as well as locality specifiers at the moment. There are PRs upstream to start supporting this and more to come towards a complete implementation.

I will keep you posted.

ergawy avatar May 06 '25 17:05 ergawy

Hi,

Thanks for looking into this! My machine with the AMD GPU is not up right now (had to replace a failing system with it) but once I get it back up (within a few weeks) I will give it a try again.

As for the reductions, we have a branch on the HipFT github that uses OpenMP reductions instead of DC if that helps. It is the "waccpd24" branch. If the OpenMP target implementation is more mature than the DC one, that branch may be useful dev tool.

  • Ron

sumseq avatar May 06 '25 21:05 sumseq

I built using the "waccpd24" branch and the result seem to be more reasonably correct than the "main" branch using do concurrent.

How accurate/correct is the reference data? The tests still fail but the delta between reference and run data is not that drastic as it is for do concurrent. For example:

# using the "waccpd24" branch
$ ./run_test_suite.sh -test=advect_gaussians_phi

==> COMPARING RUN DATA TO REFERENCE DATA
=======================================================
==> Running comparison...

WARNING: PASSED A FAIL (value <1e-64) in BR_MIN:
        -1.378037984328386E+000
        -1.424039803492147E+000
FAIL in BR_MAX:
        1.378628228877954E+000
        1.424039803492146E+000
FAIL in FLUX_POSITIVE:
        4.569448091947724E+020
        4.565513240141229E+020
WARNING: PASSED A FAIL (value <1e-64) in FLUX_NEGATIVE:
        -4.569448091947724E+020
        -4.565513240141230E+020
FAIL in VALIDATION_ERR_HHabs:
        2.894308580544581E-002
        9.885666886289569E-002

==> Test seems to have FAILED!
==> 1 
=======================================================
==> CLEANING RUN DATA
=======================================================
==> Removing files from run data...
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
===========================================================================
Summary of test results:
===========================================================================
Test name                            PASS/FAIL  Run-time  Ref-time  Speedup
===========================================================================
advect_gaussians_phi                 FAIL          9.777     0.621     0.06
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

ergawy avatar May 07 '25 08:05 ergawy

Hi,

The results should be better then that. The testsuite only checks about 5 decimal places, so I get a PASS with nvfortran GPU and CPU, as well as with gfortran.

That said, if you are only getting the "PASSED A FAIL" that should be fine.

I pushed an update to that branch yesterday right before sending my message, so if you did not yet do a git pull on it, you can try that. There were a couple of reduction loops that were still DC, and a bug in the diffusion algorithm.

-- Ron

sumseq avatar May 07 '25 18:05 sumseq

Hi,

I finally got myself a RADEON RDMA2 GPU and am back to trying to get the code working.

I downloaded the AFAR compiler and ran into two issues:

  1. do concurrent "reduce" is still not supported (as you mentioned above) so I switched to the waccpd branch of HipFT.
  2. It seems like the MODULO() Fortran intrinsic cannot be used yet inside a GPU compute loop? This is used to compute the validation solution. I replaced the modulo() calls with hand-written code (I tested the new code on a NVIDIA GPU and it passed all tests).

I could now compile and run the code on the AMD GPU.. nvtop shows the code running on the AMD GPU and using 99% utilization. However, the code is running VERY slow. Maybe it is not using my unstructured openmp data regions? Or defaulting the mapping on openmp/do concurrent to map tofrom always instead of the default "present or copy" behavior?

I tried using an older (two versions back) AFAR (8473) and the code compiled and ran much faster, but got wrong answers (like earlier in this thread).

I tried AFAR 8705 (one version back) and it compiled fine but also runs super slow and still gets the wrong answer. I see it using the AMD GPU in nvtop.

I am wondering if it is having trouble with this loop (Intel GPUs had issues with it in the past before a compiler update):

!$omp target teams distribute private(fn,fs)
      do i=1,nr
        fn = zero
        fs = zero
!$omp parallel do reduction(+:fn,fs)
        do k=2,npm-1
          fn = fn + flux_t(   2,k,i)*dp(k)
          fs = fs + flux_t(ntm1,k,i)*dp(k)
        enddo
!$omp parallel do
        do k=2,npm-1
          aop(  1,k,i) =  fn*bc_flow_npole_fac
          aop(ntm,k,i) = -fs*bc_flow_spole_fac
        enddo
      enddo

-Ron

sumseq avatar Dec 12 '25 00:12 sumseq

Update:

I installed the latest driver I could find for my RADEON RX6800: amdgpu-install-7.1.1.70101-1.el9.noarch.rpm and the newest drop of AFAR: rocm-afar-10004-drop-22.3.0 which I installed into /opt/afar.

I am loading the environment with:

export PATH=/opt/afar/bin:$PATH
export LD_LIBRARY_PATH=/opt/afar/lib:$LD_LIBRARY_PATH

export CC=amdclang
export CXX=amdclang++
export FC=amdflang
export F77=amdflang
export F90=amdflang
export OMPI_FC=amdflang

I am using the WACCPD branch of HipFT with the following change to replace the use of MODULO with hand-written code:

-          p1 = MODULO(pi_two+vpt,      twopi)
-          p2 = MODULO(threepi_two+vpt, twopi)
+          p1 = pi_two+vpt       
+          p2 = threepi_two+vpt
+          p1 = p1 - twopi * FLOOR(p1 / twopi)
+          if (p1 < 0.0) then
+            p1 = p1 + twopi
+          end if
+          p2 = p2 - twopi * FLOOR(p2 / twopi)
+          if (p2 < 0.0) then
+            p2 = p2 + twopi
+          end if

I am using the compiler MPI wrapper line:

amdflang -I/opt/psi/afar/ext_deps/deps/openmpi-5.0.6/include 
-I/opt/psi/afar/ext_deps/deps/openmpi-5.0.6/lib 
-L/opt/psi/afar/ext_deps/deps/openmpi-5.0.6/lib 
-Wl,-rpath -Wl,/opt/psi/afar/ext_deps/deps/openmpi-5.0.6/lib 
-Wl,--enable-new-dtags -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi

where I have compiled the OpenMPI 5.0.6 with AFAR.

I am using the compiler options:

-O3 -fopenmp -fdo-concurrent-to-openmp=device --offload-arch=gfx1030

The code compiles fine. However when I run the testsuite, the code is running extremely slow. myamd-smi is showing only 1% GPU utilization and only 42W out of a 200W TDP, but is using the correct amount of VRAM:

+---------------------------------------------------------------------------------v2025.04.29---+
| Driver: 6.16.6   ROCM-SMI: 4.0.0+54fe1cb   ROCM-SMI-LIB: 7.8.0                                |
| AMDSMI Tool: 26.2.0+021c61fc-dirty   AMDSMI Library: 26.2.0   ROCm: 7.1.1                     |
| amdgpu: 6.16.6   amd_hsmp: N/A                                                                |
|----------------------------------------+----------------------------------+-------------------|
| GPU               Name                 |                                  |      GPU-Util     |
|  Fan   Temp    Perf    Power-Usage     |      Memory-Usage                |   SCLK     MCLK   |
|========================================+==================================+===================|
| 0   AMD Radeon RX 6800 (gfx1030)       | VRAM:    323MiB/ 16368MiB (  0%) |        1%         |
|  14%  45.0°C  auto   42.0W/203.0W      | GTT :     14MiB/ 64359MiB (  0%) |  2305Mhz    96Mhz |
|----------------------------------------+----------------------------------+-------------------|

The good news (and update) is that the testsuite now passes the tests (gets the right answer)!

The long run times lead me to believe that either things are being run serially on the GPU, or its a bad mapping for the OpenMP target derived from do concurrent. The default OpenMP Target mapping should be a "present-or-copy" behavior, which, when combined with my unstructured data regions, should avoid data transfers.
The Intel compiler had a similar problem (they were defaulting the mapping to "map:always"), so a new compiler flag was added to specify that a default of "present" should be used.
Could something like that be happening here?

Is there a simple profiling command to see how much data transfers are happening versus compute?

Please let me know if you need more information or want me to try something else?

-- Ron

sumseq avatar Jan 06 '26 02:01 sumseq

Thanks a lot for following up on this. I will look into your results above and come back to you latest next week.

ergawy avatar Jan 06 '26 06:01 ergawy