amrex icon indicating copy to clipboard operation
amrex copied to clipboard

Spock: pthread issue

Open ax3l opened this issue 4 years ago • 15 comments

from @WeiqunZhang via <unknown user> report on Spock (OLCF).

On a login node:

module load cmake/3.21.2-dev rocm/4.3.0
git clone [email protected]:AMReX-Codes/amrex-tutorials.git
cd amerx-tutorials
cmake -S . \
-B build/3d.gnu.float.hip \
-DAMReX_FORTRAN=OFF \
-DAMReX_GPU_BACKEND=HIP \
-DAMReX_AMD_ARCH=gfx908 \
-DAMReX_OMP=OFF \
-DAMReX_MPI=OFF \
-DAMReX_LINEAR_SOLVERS=OFF \
-DAMReX_PRECISION=SINGLE \
-DAMReX_SPACEDIM=3 \
-DCMAKE_CXX_COMPILER=/opt/rocm-4.3.0/llvm/bin/clang++ \
-DCMAKE_CXX_STANDARD=17 \
-DCMAKE_VERBOSE_MAKEFILE:BOOL=ON \
-DAMReX_TINY_PROFILE=OFF -DAMReX_BASE_PROFILE=OFF \
-DAMReX_AMRLEVEL=OFF \
-DCMAKE_BUILD_TYPE=Release
cmake --build build/3d.gnu.float.hip -j 12

results in

[ 72%] Linking CXX executable Amr_Advection_AmrCore
cd /ccs/home/wqzhang/mygitrepo/amrex-tutorials/build/3d.gnu.float.hip/Amr/Advection_AmrCore && /autofs/nccs-svm1_sw/spock/spack-envs/base/opt/linux-sles15-x86_64/gcc-7.5.0/cmake-3.21.2-dev-ovcgpray6yyjz2n7wjuv6lv4qkgietzs/bin/cmake -E cmake_link_script CMakeFiles/Amr_Advection_AmrCore.dir/link.txt --verbose=1
/opt/rocm-4.3.0/llvm/bin/clang++ -O3 -DNDEBUG -fgpu-rdc CMakeFiles/Amr_Advection_AmrCore.dir/Source/AdvancePhiAllLevels.cpp.o CMakeFiles/Amr_Advection_AmrCore.dir/Source/AdvancePhiAtLevel.cpp.o CMakeFiles/Amr_Advection_AmrCore.dir/Source/AmrCoreAdv.cpp.o CMakeFiles/Amr_Advection_AmrCore.dir/Source/DefineVelocity.cpp.o CMakeFiles/Amr_Advection_AmrCore.dir/Source/main.cpp.o -o Amr_Advection_AmrCore  -Wl,-rpath,/opt/rocm-4.3.0/hip/lib:/opt/rocm-4.3.0/lib:/opt/rocm-4.3.0/hiprand/lib:/opt/rocm-4.3.0/rocrand/lib ../../_deps/amrex-build/Src/libamrex.a /opt/rocm-4.3.0/hip/lib/libamdhip64.so.4.3.40300 --hip-link --offload-arch=gfx908 -L"/opt/rocm-4.3.0/llvm/lib/clang/13.0.0/include/../lib/linux" -lclang_rt.builtins-x86_64 /opt/rocm-4.3.0/hiprand/lib/libhiprand.so.1.1.40300 /opt/rocm-4.3.0/rocrand/lib/librocrand.so.1.1.40300 -Wl,-rpath-link,/opt/rocm-4.3.0/lib
ld.lld: error: undefined symbol: pthread_create
>>> referenced by AMReX_BackgroundThread.cpp
>>>               AMReX_BackgroundThread.cpp.o:(amrex::BackgroundThread::BackgroundThread()) in archive ../../_deps/amrex-build/Src/libamrex.a

ax3l avatar Sep 03 '21 21:09 ax3l

Most likely issue: https://github.com/ROCmSoftwarePlatform/rocRAND/pull/29#issuecomment-912815457

$ ldd /opt/rocm-4.3.0/rocrand/lib/librocrand.so.1.1.40300
...
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f0650e4e000)
...

ax3l avatar Sep 03 '21 21:09 ax3l

Actually, it looks like we miss it, since I cannot find another libpthread dependency unresolved in rocrand.

ld.lld: error: undefined symbol: pthread_create
>>> referenced by AMReX_BackgroundThread.cpp
>>>               AMReX_BackgroundThread.cpp.o:(amrex::BackgroundThread::BackgroundThread()) in archive ../../_deps/amrex-build/Src/libamrex.a

Interesting, since we search and link pthreads: https://github.com/AMReX-Codes/amrex/blob/168a690497396de4c6b89a36b6edb0430e51ef4c/Tools/CMake/AMReXParallelBackends.cmake#L1-L8

ax3l avatar Sep 03 '21 22:09 ax3l

The CMake output from this setup:

-- The C compiler identification is Clang 12.0.0
-- The CXX compiler identification is Clang 13.0.0
...
-- Check for working C compiler: /opt/cray/pe/craype/2.7.8/bin/cc - skipped
...
-- Check for working CXX compiler: /opt/rocm-4.3.0/llvm/bin/clang++ - skipped

is concerning. Looks like the Cray and the AMD Clang are mixed.

One should add

-DCMAKE_C_COMPILER=/opt/rocm-4.3.0/llvm/bin/clang

too for consistency.

ax3l avatar Sep 03 '21 22:09 ax3l

Yes, we should do that. That seems to fix the pthread issue.

WeiqunZhang avatar Sep 03 '21 22:09 WeiqunZhang

Let's ignore the errors in compiling tutorials that use AmrLeve. If I run amrex-tutorials/build/3d.gnu.float.hip/Basic/HelloWorld_C/Basic_HelloWorld_C, I get

Initializing HIP...
HIP initialized.
"Cannot find Symbol"
SIGABRT
See Backtrace.0 file for details

So now we have reproduced the symbol issue reported to us.

WeiqunZhang avatar Sep 03 '21 22:09 WeiqunZhang

Compiling now with

cmake -S . -B build/3d.gnu.float.hip -DAMReX_FORTRAN=OFF -DAMReX_GPU_BACKEND=HIP -DAMReX_AMD_ARCH=gfx908 -DAMReX_OMP=OFF -DAMReX_MPI=OFF -DAMReX_PRECISION=SINGLE -DAMReX_SPACEDIM=3 -DCMAKE_CXX_COMPILER=/opt/rocm-4.3.0/llvm/bin/clang++ -DCMAKE_CXX_STANDARD=17 -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/opt/rocm-4.3.0/llvm/bin/clang
cmake --build build/3d.gnu.float.hip -j 12

to reproduce

ax3l avatar Sep 03 '21 22:09 ax3l

With cmake 3.20.2 we can use hipcc as CXX Compiler:

cmake -S . -B build/3d.gnu.float.hip -DAMReX_FORTRAN=OFF -DAMReX_GPU_BACKEND=HIP -DAMReX_AMD_ARCH=gfx908 -DAMReX_OMP=OFF -DAMReX_MPI=OFF -DAMReX_PRECISION=SINGLE -DAMReX_SPACEDIM=3 -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_CXX_STANDARD=17 -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/opt/rocm-4.3.0/llvm/bin/clang

So just some llvm magic flags from hipcc missing.

ax3l avatar Sep 03 '21 23:09 ax3l

Same thing with cmake/3.21.2-dev unravels the hipcc to clang++.

Now we have to work around that already fixed upstream bug about defaults in -x cxx and -x hip front-ends: (ref)

export CXXFLAGS="-std=c++17"
cmake -S . -B build/3d.gnu.float.hip -DAMReX_FORTRAN=OFF -DAMReX_GPU_BACKEND=HIP -DAMReX_AMD_ARCH=gfx908 -DAMReX_OMP=OFF -DAMReX_MPI=OFF -DAMReX_PRECISION=SINGLE -DAMReX_SPACEDIM=3 -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_CXX_STANDARD=17 -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/opt/rocm-4.3.0/llvm/bin/clang

That then still raises "Cannot find Symbol" though, so some llvm flags still being lost somewhere, maybe because ROCm 4.3.0 does not yet anticipate CMake 3.21-dev and thus the hip::device misses some flags or so.

User should for now not use a dev version of CMake on Spock, but just the latest stable release.

ax3l avatar Sep 03 '21 23:09 ax3l

For the "Cannot find Symbol" issue, one can strace the application like this (Crusher example):

export proj=aphXYZ  # change this to your OLCF project
alias runNode="srun -A $proj -J warpx -t 00:30:00 -p batch -N 1 -c 8 --ntasks-per-node=8"

cd build/bin
runNode strace ./warpx ../../Examples/Physics_applications/laser_acceleration/inputs_3d 2>&1 | grep -E '^open(at)?\(.*\.so'

Note latest Crusher instructions in WarpX: https://warpx.readthedocs.io/en/latest/install/hpc/crusher.html

ax3l avatar Jan 20 '22 19:01 ax3l

I am getting the "Cannot find Symbol" issue on NCSA Delta's MI100 node. Unfortunately, it doesn't have the Cray compilers installed, so I can't follow the WarpX build instructions. Is there another workaround?

BenWibking avatar Sep 12 '22 15:09 BenWibking

gnu make

WeiqunZhang avatar Sep 12 '22 15:09 WeiqunZhang

gnu make

Weirdly, although it complains, it also works if it set CMAKE_CXX_COMPILER to hipcc. Is this a CMake bug?

BenWibking avatar Sep 12 '22 16:09 BenWibking

I don't know. GNU make uses the hipcc wrapper instead of AMD's clang.

WeiqunZhang avatar Sep 12 '22 16:09 WeiqunZhang

FYI- the hipcc/amdclang++ issue has been passed along to AMD's ROCm dev team.

BenWibking avatar Sep 23 '22 18:09 BenWibking

I ran into this as well (ORNL crusher this time). I used Cray's CC wrapper and cmake. Is there a solution other than using hipcc or GNU make? I am building for Cactus/CarpetX which itself is a complex build system so, given that it took me a couple days getting things to work with CC, I am hoping to not have to redo everything for hipcc ;-)

rhaas80 avatar Mar 23 '23 18:03 rhaas80