hipBLASLt icon indicating copy to clipboard operation
hipBLASLt copied to clipboard

[Issue]: Build false ChildProcessError: [Errno 10] No child processes and error: not a valid operand. v_cvt_f32_bf8 v14, v43 op_sel:[0,0]

Open RandUser123sa opened this issue 9 months ago • 3 comments

Problem Description

In few moments after start build hipBLASLt release version 6.4.0 I receive error:

Waitcnt0Disabled 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 WrokGroupIdFromTTM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

Found hipcc version 6.4.43482-9999

LogicFilter: /mnt/arch/rocm/release/hipBLASLt-rocm-6.4.0/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/**/*.yaml

Experimental: False

Loading Logics...: Launching 16 threads... Loading Logics...: Done. (0.1 secs elapsed) Exception ignored in: <function ResourceTracker.del at 0x7fb53c754400> Traceback (most recent call last): File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 77, in del File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 86, in _stop File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 111, in _stop_locked ChildProcessError: [Errno 10] No child processes /mnt/arch/rocm/rocm-build/build/hipBLASLt/virtualenv/lib64/python3.12/site-packages/joblib/externals/loky/process_executor.py:752: UserWarning: A worker stopped while some jobs were gi ven to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn( Exception ignored in: <function ResourceTracker.del at 0x7fc143e9c400> Traceback (most recent call last): File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 77, in del File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 86, in _stop File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 111, in _stop_locked ChildProcessError: [Errno 10] No child processes

and few hours later : /mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Alik_Bjlk_F8B8B8S_BH_HAS_UserArgs_MT32x32x3vICb4pM2inKKLkrRvXgYJZ2qBKqDfjovYn_NDfZDZos=.s:3638:24: err or: not a valid operand. v_cvt_f32_bf8 v14, v43 op_sel:[0,0]

and after four hours:

Command '['/opt/rocm/bin/amdclang++', '-x', 'assembler', '-target', 'amdgcn-amd-amdhsa', '-mcode-object-version=4', '-mcpu=gfx1200', '-mno-wavefrontsize64', '-c', '-o', '/mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Ailk_Bljk_B8F8B8S_BH_BiasSHB_HAS_SAB_SCD_SAk8PKFY_UaX9JsginOxrU3oENm8BTisjLZrhVPbly2SM=.o', '/mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Ailk_Bljk_B8F8B8S_BH_BiasSHB_HAS_SAB_SCD_SAk8PKFY_UaX9JsginOxrU3oENm8BTisjLZrhVPbly2SM=.s']' returned non-zero exit status 1. Command '['/opt/rocm/bin/amdclang++', '-x', 'assembler', '-target', 'amdgcn-amd-amdhsa', '-mcode-object-version=4', '-mcpu=gfx1200', '-mno-wavefrontsize64', '-c', '-o', '/mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Ailk_Bjlk_F8B8F8S_BH_BiasSHB_HAS_SAB_SCD_SA5TXEnbsl9BvO-KyRK3Pit_h5VLjtTVIwYrQOCk3XPe0=.o', '/mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Ailk_Bjlk_F8B8F8S_BH_BiasSHB_HAS_SAB_SCD_SA5TXEnbsl9BvO-KyRK3Pit_h5VLjtTVIwYrQOCk3XPe0=.s']' returned non-zero exit status 1. Command '['/opt/rocm/bin/amdclang++', '-x', 'assembler', '-target', 'amdgcn-amd-amdhsa', '-mcode-object-version=4', '-mcpu=gfx1200', '-mno-wavefrontsize64', '-c', '-o', '/mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Alik_Bjlk_B8F8B8S_BH_BiasSHB_HAS_SAB_SCD_SAzKT68Z_w_D2uOsuuowFsg1-OCoVnb_zX5tRn9NTYq6Q=.o', '/mnt/arch/rocm/rocm-build/build/hipBLASLt/library/build_tmp/TENSILE/assembly/Cijk_Alik_Bjlk_B8F8B8S_BH_BiasSHB_HAS_SAB_SCD_SAzKT68Z_w_D2uOsuuowFsg1-OCoVnb_zX5tRn9NTYq6Q=.s']' returned non-zero exit status 1. Exception ignored in: <function ResourceTracker.del at 0x7fcce4bf4400> Traceback (most recent call last): File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 77, in del File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 86, in _stop File "/usr/lib64/python3.12/multiprocessing/resource_tracker.py", line 111, in _stop_locked ChildProcessError: [Errno 10] No child processes

Currently, the compilation take 9 hours and still not the processes not finished.

2248 tty1 Dl+ 147:03 ../virtualenv/bin/python3.12 /mnt/arch/rocm/rocm-build/build/hipBLASLt/virtualenv/lib64/python3.12/site-packages/Tensile/bin/TensileCreateLibrary --merge-files --separate-architectures --lazy-library-loading --no-short-file-names --no-library-print-debug --code-object-version=default --cxx-compiler=amdclang++ --library-format=msgpack --architecture=gfx900_gfx90a_gfx942_gfx1030_gfx1100_gfx1101_gfx1102_gfx1200_gfx1201 --build-id=sha1 /mnt/arch/rocm/release/hipBLASLt-rocm-6.4.0/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full /mnt/arch/rocm/rocm-build/build/hipBLASLt/Tensile HIP 11941 tty1 S+ 0:00 /mnt/arch/rocm/rocm-build/build/hipBLASLt/library/../virtualenv/bin/python3.12 -c from joblib.externals.loky.backend.resource_tracker import main; main(3, False) 11942 tty1 S+ 0:00 /mnt/arch/rocm/rocm-build/build/hipBLASLt/library/../virtualenv/bin/python3.12 -c from multiprocessing.resource_tracker import main;main(10)

Tensile::FATAL: ** kernel compilation failure **

Operating System

Slackware 15.0

CPU

AMD Ryzen 7 3800X 8-Core Processor

GPU

AMD Radeon RX 7900 XTX

Other

No response

ROCm Version

ROCm 6.2.3

ROCm Component

hipBLASLt

Steps to Reproduce

Here is the configure and make log

configure-hipBLASLt.log

make-hipBLASLt.log

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

RandUser123sa avatar Apr 16 '25 07:04 RandUser123sa

Hi @RandUser123sa. Internal ticket has been created to investigate this issue. Thanks!

ppanchad-amd avatar Apr 16 '25 15:04 ppanchad-amd

Hi,

I was able to compile the project. First my rocTracer was version 6.3.1 I upgrade it to version 6.4.0 and I added param -D Tensile_ENABLE_MARKER=ON to my configure script and then rerun configure and build of hipBLASLt again. I don't know if this is a problem but error not a valid operand. v_cvt_f32_bf8 v14 disappears. This warning ChildProcessError: [Errno 10] No child processes still exists but it's not a problem because the compilation continue.

You could close the issue if you think is resolved.

RandUser123sa avatar Apr 18 '25 05:04 RandUser123sa

Hi @RandUser123sa, I was not able to repro the error with the install script, does ./install.sh -dc work for you? You can check your cmake flags against our CI workflow. I'd also make sure that you are not running out of resource during the build process.

zichguan-amd avatar Apr 23 '25 14:04 zichguan-amd