[Issue]: Build uses ~100 cpu-hours
Problem Description
I noticed that this project is the longest build in the rocm stack that we are using. We often build the stack from source via spack due to complications with using the binary distributions. The build is currently taking around 6 hours to finish, however, it is worse than that. It is actually using about 100 cpu-hours to build. This appears to be due to an in-built build job distribution system which launches a process for each CPU core on the system. These processes are in a spin-wait state while the distibution of jobs is very slow and not using all the workers. This results in an extreme waste of CPU cycles on systems with many cores. I have a Dockerfile which I used to reproduce this along with the cmake and make output:
Dockerfile
FROM rockylinux:9
RUN dnf -y group install development
COPY rocm.repo /etc/yum.repos.d/
RUN dnf -y install epel-release
RUN dnf -y --enablerepo=crb install perl-File-BaseDir perl-URI-Encode
RUN dnf -y install hipblaslt
RUN git clone https://github.com/ROCm/hipBLASLt /tmp/hipblaslt
WORKDIR /tmp/hipblaslt
RUN dnf -y install cmake
RUN dnf -y install rocm-hip-sdk rocprim rocm-ml-sdk rocm-openmp-sdk rocm-developer-tools
RUN dnf -y install msgpack-devel time
RUN mkdir build && \
cd build && \
cmake .. 2>&1 | tee cmake.log
RUN cd build && \
/usr/bin/time make 2>&1 | tee make.log
Cmake:
-- The CXX compiler identification is Clang 18.0.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rocm/bin/amdclang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.43.5")
-- Setting build type to 'Release' as none was specified.
-- Using amdclang to build for amdgpu backend
*******************************************************************************
*------------------------------- ROCMChecks WARNING --------------------------*
Options and properties should be set on a cmake target where possible. The
variable 'CMAKE_CXX_FLAGS' may be set by the cmake toolchain, either by
calling 'cmake -DCMAKE_CXX_FLAGS=" -D__HIP_HCC_COMPAT_MODE__=1"'
or set in a toolchain file and added with
'cmake -DCMAKE_TOOLCHAIN_FILE=<toolchain-file>'. ROCMChecks now calling:
CMake Warning at /opt/rocm/share/rocmcmakebuildtools/cmake/ROCMChecks.cmake:46 (message):
'CMAKE_CXX_FLAGS' is set at /tmp/hipblaslt/CMakeLists.txt:<line#> shown
below:
Call Stack (most recent call first):
CMakeLists.txt:9223372036854775807 (rocm_check_toolchain_var)
CMakeLists.txt:139 (set)
*-----------------------------------------------------------------------------*
*******************************************************************************
*******************************************************************************
*------------------------------- ROCMChecks WARNING --------------------------*
Options and properties should be set on a cmake target where possible. The
variable 'CMAKE_CXX_FLAGS' may be set by the cmake toolchain, either by
calling 'cmake -DCMAKE_CXX_FLAGS=" -D__HIP_HCC_COMPAT_MODE__=1 -O3"'
or set in a toolchain file and added with
'cmake -DCMAKE_TOOLCHAIN_FILE=<toolchain-file>'. ROCMChecks now calling:
CMake Warning at /opt/rocm/share/rocmcmakebuildtools/cmake/ROCMChecks.cmake:46 (message):
'CMAKE_CXX_FLAGS' is set at /tmp/hipblaslt/CMakeLists.txt:<line#> shown
below:
Call Stack (most recent call first):
CMakeLists.txt:9223372036854775807 (rocm_check_toolchain_var)
CMakeLists.txt:144 (set)
*-----------------------------------------------------------------------------*
*******************************************************************************
-- Performing Test COMPILER_HAS_TARGET_ID_gfx908_xnack_on
-- Performing Test COMPILER_HAS_TARGET_ID_gfx908_xnack_on - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx908_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx908_xnack_off - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_on
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_on - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_off - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx942
-- Performing Test COMPILER_HAS_TARGET_ID_gfx942 - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1100
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1100 - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1101
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1101 - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1200
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1200 - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1201
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1201 - Success
-- AMDGPU_TARGETS: gfx908:xnack+;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx942;gfx1100;gfx1101;gfx1200;gfx1201
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Success
-- Python_ROOT is unset. Setting Python_ROOT to /usr.
-- Configure Python_ROOT variable if a different installation is preferred.
-- Found Python: /usr/bin/python3.9 (found version "3.9.18") found components: Interpreter
'/usr/bin/python3.9' '-m' 'venv' '/tmp/hipblaslt/build/virtualenv' '--system-site-packages' '--clear'
'/tmp/hipblaslt/build/virtualenv/bin/python3.9' '-m' 'pip' 'install' '--upgrade' 'pip'
Requirement already satisfied: pip in ./virtualenv/lib/python3.9/site-packages (21.2.3)
Collecting pip
Downloading pip-25.0-py3-none-any.whl (1.8 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 21.2.3
Uninstalling pip-21.2.3:
Successfully uninstalled pip-21.2.3
Successfully installed pip-25.0
'/tmp/hipblaslt/build/virtualenv/bin/python3.9' '-m' 'pip' 'install' '--upgrade' 'setuptools'
Requirement already satisfied: setuptools in ./virtualenv/lib/python3.9/site-packages (53.0.0)
Collecting setuptools
Downloading setuptools-75.8.0-py3-none-any.whl.metadata (6.7 kB)
Downloading setuptools-75.8.0-py3-none-any.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 11.2 MB/s eta 0:00:00
Installing collected packages: setuptools
Attempting uninstall: setuptools
Found existing installation: setuptools 53.0.0
Uninstalling setuptools-53.0.0:
Successfully uninstalled setuptools-53.0.0
Successfully installed setuptools-75.8.0
'/tmp/hipblaslt/build/virtualenv/bin/python3.9' '-m' 'pip' 'install' '/tmp/hipblaslt/tensilelite'
-- Adding /tmp/hipblaslt/build/virtualenv to CMAKE_PREFIX_PATH
-- The C compiler identification is Clang 18.0.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/rocm/bin/amdclang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Tensile script: /tmp/hipblaslt/build/virtualenv/lib64/python3.9/site-packages/Tensile/bin/TensileCreateLibrary
-- Tensile_CREATE_COMMAND: /tmp/hipblaslt/build/virtualenv/bin/python3.9;/tmp/hipblaslt/build/virtualenv/lib64/python3.9/site-packages/Tensile/bin/TensileCreateLibrary;--code-object-version=4;--cxx-compiler=amdclang++;--library-format=msgpack;--architecture=gfx908:xnack+_gfx908:xnack-_gfx90a:xnack+_gfx90a:xnack-_gfx942_gfx1100_gfx1101_gfx1200_gfx1201;--build-id=sha1;/tmp/hipblaslt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full;/tmp/hipblaslt/build/Tensile;HIP
Setup source kernel targets
archs for source kernel compilation: gfx908,gfx90a,gfx942,gfx1100,gfx1101,gfx1200,gfx1201
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Configuring done (28.0s)
-- Generating done (0.0s)
-- Build files have been written to: /tmp/hipblaslt/build
Make:
[ 2%] Generating Tensile Libraries
################################################################################
# Tensile Create Library
# HIP Version: 6.3.42133-1b9c17779
# Cxx Compiler: /opt/rocm/bin/amdclang++ (version 18.0.0)
# C Compiler: /opt/rocm/bin/amdclang (version 18.0.0)
# Assembler: /opt/rocm/bin/amdclang++ (version 18.0.0)
# Offload Bundler: /opt/rocm/lib/llvm/bin/clang-offload-bundler (version 18.0.0)
# Code Object Version: 4
...
...
...
[ 83%] Building CXX object library/CMakeFiles/hipblaslt.dir/src/amd_detail/rocblaslt/src/rocblaslt_auxiliary.cpp.o
[ 86%] Building CXX object library/CMakeFiles/hipblaslt.dir/src/amd_detail/rocblaslt/src/rocblaslt_mat.cpp.o
[ 89%] Building CXX object library/CMakeFiles/hipblaslt.dir/src/amd_detail/rocblaslt/src/utility.cpp.o
[ 91%] Building CXX object library/CMakeFiles/hipblaslt.dir/src/amd_detail/rocblaslt/src/rocblaslt_transform.cpp.o
[ 94%] Building CXX object library/CMakeFiles/hipblaslt.dir/src/amd_detail/rocblaslt/src/UserDrivenTuningParser.cpp.o
[ 97%] Building CXX object library/CMakeFiles/hipblaslt.dir/src/amd_detail/rocblaslt/src/tensile_host.cpp.o
[100%] Linking CXX shared library libhipblaslt.so
[100%] Built target hipblaslt
330684.60user 17773.57system 6:01:20elapsed 1607%CPU (0avgtext+0avgdata 124557448maxresident)k
17455320inputs+337551070outputs (41790269major+2925761090minor)pagefaults 0swaps
Operating System
Rockylinux 9
CPU
Any
GPU
Other
Other
No response
ROCm Version
ROCm 6.2.3
ROCm Component
hipBLASLt
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Hi @G-Ragghianti. Internal ticket has been created to investigate this issue. Thanks!
Hi @G-Ragghianti, sorry for the inconvenience this is causing! We're aware of severe build time increases in several ROCm components post-6.2, with hipBLASLt being a particularly notable offender. We're attacking this from several angles and have some improvements in the pipeline already which should cut down the build time and binary size significantly. I don't have any firm timelines on this, but it's a high priority issue for us.
For now, I'd recommend setting AMDGPU_TARGETS to reflect only the architectures you need to build for, which should help cut down the build time and size.
Thanks for looking at it. I encourage a re-evaluation on the use of the loky/joblib for the hipblaslt build. One option that would help spack users out is if it were easy to disable loky job management via cmake. Then the spack package could disable or limit the unnecessary CPU use. I'm also surprised that loky/joblib uses a busy spin method for the multiprocess communication.
@G-Ragghianti Thanks for raising this issue. I'm on the team working to improve resource consumption during build, and rest assured, we have certainly identified joblib as a key offender for the reasons you mention. We're actively working on decoupling the parallelization layer from the build steps, after which we may either replace joblib or at the very least, make improvements to address your ask.
Oh wow. This is more than I had hoped for. Thanks a lot!
This appears to not be resolved as of the 6.4.1 release. The python processes that are driving the build are using most of the CPU cycles, leaving few cycles for the actual build processes.
I'll check in with the internal team to see what the status is on this, both in general and for the specific joblib issue. We've made improvements to reduce the build time and memory usage of the hipBlasLT build, but at the end of the day I think this is always going to take a lot of resources.
If the joblib issue isn't yet resolved in an upcoming release, I'll reopen and mark this as a feature request.