[Issue]: Build false ChildProcessError: [Errno 10] No child processes and error: not a valid operand. v_cvt_f32_bf8 v14, v43 op_sel:[0,0]
Problem Description
In few moments after start build hipBLASLt release version 6.4.0 I receive error:
Waitcnt0Disabled 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 WrokGroupIdFromTTM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
Found hipcc version 6.4.43482-9999
LogicFilter: /mnt/arch/rocm/release/hipBLASLt-rocm-6.4.0/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/**/*.yaml
Experimental: False
Loading Logics...: Launching 16 threads... Loading Logics...: Done. (0.1 secs elapsed) Exception ignored in: <function ResourceTracker.del at 0x7fb53c754400> Traceback (most recent call last): File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 77, in del File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 86, in _stop File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 111, in _stop_locked ChildProcessError: [Errno 10] No child processes /mnt/arch/rocm/rocm-build/build/hipBLASLt/virtualenv/lib64/python3.12/site-packages/joblib/externals/loky/process_executor.py:752: UserWarning: A worker stopped while some jobs were gi ven to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn( Exception ignored in: <function ResourceTracker.del at 0x7fc143e9c400> Traceback (most recent call last): File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 77, in del File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 86, in _stop File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 111, in _stop_locked ChildProcessError: [Errno 10] No child processes
and few hours later : /mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Alik_Bjlk_F8B8B8S_BH_HAS_UserArgs_MT32x32x3vICb4pM2inKKLkrRvXgYJZ2qBKqDfjovYn_NDfZDZos=.s:3638:24: err or: not a valid operand. v_cvt_f32_bf8 v14, v43 op_sel:[0,0]
and after four hours:
Command '['/opt/rocm/bin/amdclang++', '-x', 'assembler', '-target', 'amdgcn-amd-amdhsa', '-mcode-object-version=4', '-mcpu=gfx1200', '-mno-wavefrontsize64', '-c', '-o', '/mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Ailk_Bljk_B8F8B8S_BH_BiasSHB_HAS_SAB_SCD_SAk8PKFY_UaX9JsginOxrU3oENm8BTisjLZrhVPbly2SM=.o', '/mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Ailk_Bljk_B8F8B8S_BH_BiasSHB_HAS_SAB_SCD_SAk8PKFY_UaX9JsginOxrU3oENm8BTisjLZrhVPbly2SM=.s']' returned non-zero exit status 1. Command '['/opt/rocm/bin/amdclang++', '-x', 'assembler', '-target', 'amdgcn-amd-amdhsa', '-mcode-object-version=4', '-mcpu=gfx1200', '-mno-wavefrontsize64', '-c', '-o', '/mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Ailk_Bjlk_F8B8F8S_BH_BiasSHB_HAS_SAB_SCD_SA5TXEnbsl9BvO-KyRK3Pit_h5VLjtTVIwYrQOCk3XPe0=.o', '/mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Ailk_Bjlk_F8B8F8S_BH_BiasSHB_HAS_SAB_SCD_SA5TXEnbsl9BvO-KyRK3Pit_h5VLjtTVIwYrQOCk3XPe0=.s']' returned non-zero exit status 1. Command '['/opt/rocm/bin/amdclang++', '-x', 'assembler', '-target', 'amdgcn-amd-amdhsa', '-mcode-object-version=4', '-mcpu=gfx1200', '-mno-wavefrontsize64', '-c', '-o', '/mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Alik_Bjlk_B8F8B8S_BH_BiasSHB_HAS_SAB_SCD_SAzKT68Z_w_D2uOsuuowFsg1-OCoVnb_zX5tRn9NTYq6Q=.o', '/mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Alik_Bjlk_B8F8B8S_BH_BiasSHB_HAS_SAB_SCD_SAzKT68Z_w_D2uOsuuowFsg1-OCoVnb_zX5tRn9NTYq6Q=.s']' returned non-zero exit status 1. Exception ignored in: <function ResourceTracker.del at 0x7fcce4bf4400> Traceback (most recent call last): File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 77, in del File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 86, in _stop File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 111, in _stop_locked ChildProcessError: [Errno 10] No child processes
Currently, the compilation take 9 hours and still not the processes not finished.
2248 tty1 Dl+ 147:03 ../virtualenv/bin/python3.12 /mnt/arch/rocm/rocm-build/build/hipBLASLt/virtualenv/lib64/python3.12/site-packages/Tensile/bin/TensileCreateLibrary --merge-files --separate-architectures --lazy-library-loading --no-short-file-names --no-library-print-debug --code-object-version=default --cxx-compiler=amdclang++ --library-format=msgpack --architecture=gfx900_gfx90a_gfx942_gfx1030_gfx1100_gfx1101_gfx1102_gfx1200_gfx1201 --build-id=sha1 /mnt/arch/rocm/release/hipBLASLt-rocm-6.4.0/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full /mnt/arch/rocm/rocm-build/build/hipBLASLt/Tensile HIP 11941 tty1 S+ 0:00 /mnt/arch/rocm/rocm-build/build/hipBLASLt/library/../virtualenv/bin/python3.12 -c from joblib.externals.loky.backend.resource_tracker import main; main(3, False) 11942 tty1 S+ 0:00 /mnt/arch/rocm/rocm-build/build/hipBLASLt/library/../virtualenv/bin/python3.12 -c from multiprocessing.resource_tracker import main;main(10)
Tensile::FATAL: ** kernel compilation failure **
Operating System
Slackware 15.0
CPU
AMD Ryzen 7 3800X 8-Core Processor
GPU
AMD Radeon RX 7900 XTX
Other
No response
ROCm Version
ROCm 6.2.3
ROCm Component
hipBLASLt
Steps to Reproduce
Here is the configure and make log
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Hi @RandUser123sa. Internal ticket has been created to investigate this issue. Thanks!
Hi,
I was able to compile the project. First my rocTracer was version 6.3.1 I upgrade it to version 6.4.0 and I added param -D Tensile_ENABLE_MARKER=ON to my configure script and then rerun configure and build of hipBLASLt again. I don't know if this is a problem but error not a valid operand. v_cvt_f32_bf8 v14 disappears. This warning ChildProcessError: [Errno 10] No child processes still exists but it's not a problem because the compilation continue.
You could close the issue if you think is resolved.
Hi @RandUser123sa, I was not able to repro the error with the install script, does ./install.sh -dc work for you? You can check your cmake flags against our CI workflow. I'd also make sure that you are not running out of resource during the build process.