hipBLASLt icon indicating copy to clipboard operation
hipBLASLt copied to clipboard

[Issue]: Build uses ~100 cpu-hours

Open G-Ragghianti opened this issue 10 months ago • 5 comments

Problem Description

I noticed that this project is the longest build in the rocm stack that we are using. We often build the stack from source via spack due to complications with using the binary distributions. The build is currently taking around 6 hours to finish, however, it is worse than that. It is actually using about 100 cpu-hours to build. This appears to be due to an in-built build job distribution system which launches a process for each CPU core on the system. These processes are in a spin-wait state while the distibution of jobs is very slow and not using all the workers. This results in an extreme waste of CPU cycles on systems with many cores. I have a Dockerfile which I used to reproduce this along with the cmake and make output:

Dockerfile

FROM rockylinux:9
RUN dnf -y group install development
COPY rocm.repo /etc/yum.repos.d/
RUN dnf -y install epel-release
RUN dnf -y --enablerepo=crb install perl-File-BaseDir perl-URI-Encode
RUN dnf -y install hipblaslt
RUN git clone https://github.com/ROCm/hipBLASLt /tmp/hipblaslt
WORKDIR /tmp/hipblaslt
RUN dnf -y install cmake
RUN dnf -y install rocm-hip-sdk rocprim rocm-ml-sdk rocm-openmp-sdk rocm-developer-tools
RUN dnf -y install msgpack-devel time
RUN mkdir build && \
    cd build && \
    cmake .. 2>&1 | tee cmake.log
RUN cd build && \
    /usr/bin/time make 2>&1 | tee make.log

Cmake:

-- The CXX compiler identification is Clang 18.0.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rocm/bin/amdclang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.43.5") 
-- Setting build type to 'Release' as none was specified.
-- Using amdclang to build for amdgpu backend

*******************************************************************************
*------------------------------- ROCMChecks WARNING --------------------------*
  Options and properties should be set on a cmake target where possible. The
  variable 'CMAKE_CXX_FLAGS' may be set by the cmake toolchain, either by
  calling 'cmake -DCMAKE_CXX_FLAGS=" -D__HIP_HCC_COMPAT_MODE__=1"'
  or set in a toolchain file and added with
  'cmake -DCMAKE_TOOLCHAIN_FILE=<toolchain-file>'. ROCMChecks now calling:
CMake Warning at /opt/rocm/share/rocmcmakebuildtools/cmake/ROCMChecks.cmake:46 (message):
  'CMAKE_CXX_FLAGS' is set at /tmp/hipblaslt/CMakeLists.txt:<line#> shown
  below:
Call Stack (most recent call first):
  CMakeLists.txt:9223372036854775807 (rocm_check_toolchain_var)
  CMakeLists.txt:139 (set)


*-----------------------------------------------------------------------------*
*******************************************************************************


*******************************************************************************
*------------------------------- ROCMChecks WARNING --------------------------*
  Options and properties should be set on a cmake target where possible. The
  variable 'CMAKE_CXX_FLAGS' may be set by the cmake toolchain, either by
  calling 'cmake -DCMAKE_CXX_FLAGS=" -D__HIP_HCC_COMPAT_MODE__=1 -O3"'
  or set in a toolchain file and added with
  'cmake -DCMAKE_TOOLCHAIN_FILE=<toolchain-file>'. ROCMChecks now calling:
CMake Warning at /opt/rocm/share/rocmcmakebuildtools/cmake/ROCMChecks.cmake:46 (message):
  'CMAKE_CXX_FLAGS' is set at /tmp/hipblaslt/CMakeLists.txt:<line#> shown
  below:
Call Stack (most recent call first):
  CMakeLists.txt:9223372036854775807 (rocm_check_toolchain_var)
  CMakeLists.txt:144 (set)


*-----------------------------------------------------------------------------*
*******************************************************************************

-- Performing Test COMPILER_HAS_TARGET_ID_gfx908_xnack_on
-- Performing Test COMPILER_HAS_TARGET_ID_gfx908_xnack_on - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx908_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx908_xnack_off - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_on
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_on - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_off - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx942
-- Performing Test COMPILER_HAS_TARGET_ID_gfx942 - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1100
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1100 - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1101
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1101 - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1200
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1200 - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1201
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1201 - Success
-- AMDGPU_TARGETS: gfx908:xnack+;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx942;gfx1100;gfx1101;gfx1200;gfx1201
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Success
-- Python_ROOT is unset. Setting Python_ROOT to /usr.
-- Configure Python_ROOT variable if a different installation is preferred.
-- Found Python: /usr/bin/python3.9 (found version "3.9.18") found components: Interpreter 
'/usr/bin/python3.9' '-m' 'venv' '/tmp/hipblaslt/build/virtualenv' '--system-site-packages' '--clear'
'/tmp/hipblaslt/build/virtualenv/bin/python3.9' '-m' 'pip' 'install' '--upgrade' 'pip'
Requirement already satisfied: pip in ./virtualenv/lib/python3.9/site-packages (21.2.3)
Collecting pip
  Downloading pip-25.0-py3-none-any.whl (1.8 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.2.3
    Uninstalling pip-21.2.3:
      Successfully uninstalled pip-21.2.3
Successfully installed pip-25.0
'/tmp/hipblaslt/build/virtualenv/bin/python3.9' '-m' 'pip' 'install' '--upgrade' 'setuptools'
Requirement already satisfied: setuptools in ./virtualenv/lib/python3.9/site-packages (53.0.0)
Collecting setuptools
  Downloading setuptools-75.8.0-py3-none-any.whl.metadata (6.7 kB)
Downloading setuptools-75.8.0-py3-none-any.whl (1.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 11.2 MB/s eta 0:00:00
Installing collected packages: setuptools
  Attempting uninstall: setuptools
    Found existing installation: setuptools 53.0.0
    Uninstalling setuptools-53.0.0:
      Successfully uninstalled setuptools-53.0.0
Successfully installed setuptools-75.8.0
'/tmp/hipblaslt/build/virtualenv/bin/python3.9' '-m' 'pip' 'install' '/tmp/hipblaslt/tensilelite'
-- Adding /tmp/hipblaslt/build/virtualenv to CMAKE_PREFIX_PATH
-- The C compiler identification is Clang 18.0.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/rocm/bin/amdclang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Tensile script: /tmp/hipblaslt/build/virtualenv/lib64/python3.9/site-packages/Tensile/bin/TensileCreateLibrary
-- Tensile_CREATE_COMMAND: /tmp/hipblaslt/build/virtualenv/bin/python3.9;/tmp/hipblaslt/build/virtualenv/lib64/python3.9/site-packages/Tensile/bin/TensileCreateLibrary;--code-object-version=4;--cxx-compiler=amdclang++;--library-format=msgpack;--architecture=gfx908:xnack+_gfx908:xnack-_gfx90a:xnack+_gfx90a:xnack-_gfx942_gfx1100_gfx1101_gfx1200_gfx1201;--build-id=sha1;/tmp/hipblaslt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full;/tmp/hipblaslt/build/Tensile;HIP
Setup source kernel targets
archs for source kernel compilation: gfx908,gfx90a,gfx942,gfx1100,gfx1101,gfx1200,gfx1201
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Configuring done (28.0s)
-- Generating done (0.0s)
-- Build files have been written to: /tmp/hipblaslt/build

Make:

[  2%] Generating Tensile Libraries

################################################################################
# Tensile Create Library
# HIP Version:         6.3.42133-1b9c17779
# Cxx Compiler:        /opt/rocm/bin/amdclang++ (version 18.0.0)
# C Compiler:          /opt/rocm/bin/amdclang (version 18.0.0)
# Assembler:           /opt/rocm/bin/amdclang++ (version 18.0.0)
# Offload Bundler:     /opt/rocm/lib/llvm/bin/clang-offload-bundler (version 18.0.0)
# Code Object Version: 4
...
...
...
[ 83%] Building CXX object library/CMakeFiles/hipblaslt.dir/src/amd_detail/rocblaslt/src/rocblaslt_auxiliary.cpp.o
[ 86%] Building CXX object library/CMakeFiles/hipblaslt.dir/src/amd_detail/rocblaslt/src/rocblaslt_mat.cpp.o
[ 89%] Building CXX object library/CMakeFiles/hipblaslt.dir/src/amd_detail/rocblaslt/src/utility.cpp.o
[ 91%] Building CXX object library/CMakeFiles/hipblaslt.dir/src/amd_detail/rocblaslt/src/rocblaslt_transform.cpp.o
[ 94%] Building CXX object library/CMakeFiles/hipblaslt.dir/src/amd_detail/rocblaslt/src/UserDrivenTuningParser.cpp.o
[ 97%] Building CXX object library/CMakeFiles/hipblaslt.dir/src/amd_detail/rocblaslt/src/tensile_host.cpp.o
[100%] Linking CXX shared library libhipblaslt.so
[100%] Built target hipblaslt
330684.60user 17773.57system 6:01:20elapsed 1607%CPU (0avgtext+0avgdata 124557448maxresident)k
17455320inputs+337551070outputs (41790269major+2925761090minor)pagefaults 0swaps

Operating System

Rockylinux 9

CPU

Any

GPU

Other

Other

No response

ROCm Version

ROCm 6.2.3

ROCm Component

hipBLASLt

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

G-Ragghianti avatar Feb 03 '25 08:02 G-Ragghianti

Hi @G-Ragghianti. Internal ticket has been created to investigate this issue. Thanks!

ppanchad-amd avatar Feb 03 '25 14:02 ppanchad-amd

Hi @G-Ragghianti, sorry for the inconvenience this is causing! We're aware of severe build time increases in several ROCm components post-6.2, with hipBLASLt being a particularly notable offender. We're attacking this from several angles and have some improvements in the pipeline already which should cut down the build time and binary size significantly. I don't have any firm timelines on this, but it's a high priority issue for us.

For now, I'd recommend setting AMDGPU_TARGETS to reflect only the architectures you need to build for, which should help cut down the build time and size.

schung-amd avatar Feb 03 '25 15:02 schung-amd

Thanks for looking at it. I encourage a re-evaluation on the use of the loky/joblib for the hipblaslt build. One option that would help spack users out is if it were easy to disable loky job management via cmake. Then the spack package could disable or limit the unnecessary CPU use. I'm also surprised that loky/joblib uses a busy spin method for the multiprocess communication.

G-Ragghianti avatar Feb 03 '25 16:02 G-Ragghianti

@G-Ragghianti Thanks for raising this issue. I'm on the team working to improve resource consumption during build, and rest assured, we have certainly identified joblib as a key offender for the reasons you mention. We're actively working on decoupling the parallelization layer from the build steps, after which we may either replace joblib or at the very least, make improvements to address your ask.

bstefanuk avatar Feb 04 '25 08:02 bstefanuk

Oh wow. This is more than I had hoped for. Thanks a lot!

G-Ragghianti avatar Feb 04 '25 08:02 G-Ragghianti

This appears to not be resolved as of the 6.4.1 release. The python processes that are driving the build are using most of the CPU cycles, leaving few cycles for the actual build processes.

G-Ragghianti avatar Jul 18 '25 13:07 G-Ragghianti

I'll check in with the internal team to see what the status is on this, both in general and for the specific joblib issue. We've made improvements to reduce the build time and memory usage of the hipBlasLT build, but at the end of the day I think this is always going to take a lot of resources.

If the joblib issue isn't yet resolved in an upcoming release, I'll reopen and mark this as a feature request.

schung-amd avatar Jul 18 '25 14:07 schung-amd