hipFFT icon indicating copy to clipboard operation
hipFFT copied to clipboard

[Issue]: Building hipFFT on NVIDIA platform. [Perlmutter supercomputer]

Open rgayatri23 opened this issue 1 year ago • 10 comments

Problem Description

I am trying to build hipfft/rocm-5.5.1 on NVIDIA A100 GPUs available on the Perlmutter supercomputer. I already have cuda/12.2 and the corresponding cuFFT in my path. There is also hipcc/5.5.1 that is configured with the said cuda version. Here is the CMake Command:

cmake -DCMAKE_CXX_COMPILER=g++ -DCMAKE_BUILD_TYPE=Release -DBUILD_WITH_LIB=CUDA -DCMAKE_INSTALL_PREFIX=$PWD/../install -L ../

The error

-- Found ROCm
CMake Error at /global/u1/r/rgayatri/.local/cmake/share/cmake-3.23/Modules/CMakeFindDependencyMacro.cmake:47 (find_package):
  By not providing "Findamd_comgr.cmake" in CMAKE_MODULE_PATH this project
  has asked CMake to find a package configuration file provided by
  "amd_comgr", but CMake did not find one.

  Could not find a package configuration file provided by "amd_comgr" with
  any of the following names:

    amd_comgrConfig.cmake
    amd_comgr-config.cmake

  Add the installation prefix of "amd_comgr" to CMAKE_PREFIX_PATH or set
  "amd_comgr_DIR" to a directory containing one of the above files.  If
  "amd_comgr" provides a separate development package or SDK, be sure it has
  been installed.
Call Stack (most recent call first):
  /global/common/software/nersc/pe/rocm/5.5.1/lib64/cmake/hip/hip-config.cmake:183 (find_dependency)
  library/CMakeLists.txt:34 (find_package)

Operating System

SLES 15-SP4

CPU

AMD EPYC 7713 64-Core Processor

GPU

AMD Instinct MI250X

ROCm Version

ROCm 5.5.1

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

FYI - I did not find the right GPU option in the selection, so I selected randomly in order to be able to submit the issue.

rgayatri23 avatar Mar 12 '24 19:03 rgayatri23

If you set HIP_PLATFORM=nvidia in the environment, does that make a difference?

evetsso avatar Mar 12 '24 22:03 evetsso

It's already set and it did not make any difference. It's usually set in our environment whenever hip-rocm modules are loaded.

rgayatri23 avatar Mar 13 '24 00:03 rgayatri23

Hmm, can you try commenting out the find_package(HIP REQUIRED) on library/CMakeLists.txt:34? Now that I look, it doesn't seem like it should be necessary.

evetsso avatar Mar 13 '24 23:03 evetsso

@rgayatri23 Please try this: module purge module load cuda hip-cuda boost cmake fftw export HIP_PLATFORM=nvidia cmake -DROCM_DIR=<PATH_TO_HIPCUDA> -DCMAKE_MODULE_PATH=<PATH_TO_HIPCUDA>/hip/cmake/ -DCMAKE_CXX_COMPILER=hipcc -DHIP_ROOT_DIR=<PATH_TO_HIPCUDA> -DBUILD_WITH_LIB=CUDA -DBUILD_CLIENTS=ON -DCMAKE_CXX_FLAGS="-gencode=arch=compute_80,code=sm_80" ..

af-ayala avatar Mar 14 '24 01:03 af-ayala

Thanks @af-ayala . This time the build went a bit ahead but got blocked on a different issue, so partial success! CMake is unable to find FFTW, even though its definitely in the path

-- Could NOT find GTest (missing: GTEST_LIBRARY GTEST_INCLUDE_DIR GTEST_MAIN_LIBRARY) (Required is at least version "1.11.0")
CMake Error at /global/u1/r/rgayatri/.local/cmake/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find FFTW (missing: FFTW_INCLUDE_DIRS FFTW_LIBRARIES) (Required
  is at least version "3.0")
Call Stack (most recent call first):
  /global/u1/r/rgayatri/.local/cmake/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
  clients/cmake/FindFFTW.cmake:103 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  clients/tests/CMakeLists.txt:26 (find_package)


-- Configuring incomplete, errors occurred!
See also "/pscratch/sd/r/rgayatri/HIP-LZ/hipFFT/build/CMakeFiles/CMakeOutput.log".
See also "/pscratch/sd/r/rgayatri/HIP-LZ/hipFFT/build/CMakeFiles/CMakeError.log".
rgayatri@perlmutter:login40:/pscratch/sd/r/rgayatri/HIP-LZ/hipFFT/build> echo $CPATH
/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/math_libs/12.2/include:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/12.2/include
rgayatri@perlmutter:login40:/pscratch/sd/r/rgayatri/HIP-LZ/hipFFT/build> ls /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/math_libs/12.2/include/*cufft*
.rw-r--r-- 12k root 29 Sep  2023 /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/math_libs/12.2/include/cufft.h
.rw-r--r-- 19k root 29 Sep  2023 /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/math_libs/12.2/include/cufftw.h
.rw-r--r-- 12k root 29 Sep  2023 /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/math_libs/12.2/include/cufftXt.h

/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/math_libs/12.2/include/cufftmp:
.rw-r--r-- 4.1k root 29 Sep  2023 cudalibxt.h
.rw-r--r--  12k root 29 Sep  2023 cufft.h
.rw-r--r-- 5.1k root 29 Sep  2023 cufftMp.h
.rw-r--r--  19k root 29 Sep  2023 cufftw.h
.rw-r--r--  12k root 29 Sep  2023 cufftXt.h

rgayatri23 avatar Mar 14 '24 20:03 rgayatri23

Did you build FFTW yourself, or are you using the SLES packages? The distro packages are easier to use since they include both single and double precision libraries.

evetsso avatar Mar 14 '24 21:03 evetsso

The GPU softwares are all built through the distro packages.

rgayatri23 avatar Mar 15 '24 00:03 rgayatri23

If you just want to build the library, setting -DBUILD_CLIENTS=OFF will get you that. Sometimes using modules from supercomputers becomes tricky. To build our testing infrastructure with DBUILD_CLIENTS=ON, you indeed need the dependencies for which you're getting errors, I would suggest the following procedure that works for me on other clusters:

  • Get modules you need, spider will tell you what do you need to load first, e.g 'ums/default': module spider fftw module load ums/default module purge module load boost googletest module load fftw/3.3.10

af-ayala avatar Mar 15 '24 04:03 af-ayala

Even with the BUILD_CLIENTS=OFF, CMake is looking for cufft. Is there a CMake var to pass the path. I did everything from adding the path to CMAKE_PREFIX_PATH to passing it as CXX and linker flags but it looks like the path is not being picked up.

rgayatri23 avatar Mar 15 '24 19:03 rgayatri23

@rgayatri23 Can you please check if you are still seeing the issue with the latest ROCm 6.1.2? Thanks!

ppanchad-amd avatar Jul 08 '24 20:07 ppanchad-amd

This has been stale for a while; closing for now. Feel free to re-open if there's still a problem!

malcolmroberts avatar Sep 03 '24 15:09 malcolmroberts

Sure. Sorry about the delay. I am having issues building rocm/6.0 on the NVIDIA platform. I will test this again once that is done.

rgayatri23 avatar Sep 03 '24 22:09 rgayatri23