MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

[urgent] HipBuildImpl() seems fail to create temp dir or due to no GPU presents in build env

Open junliume opened this issue 1 year ago • 6 comments

build_fail.log

Actual Result:

[2024-05-10T20:01:46.194Z] /src/hip/hip_build_utils.cpp:186:42: error: expected ')'
[2024-05-11T09:45:27.064Z]   186 |             MIOPEN_THROW("Failed cmd: '" MIOPEN_HIP_COMPILER "', args: '" + args + '\'');
[2024-05-11T09:45:27.064Z]       |                                          ^
[2024-05-11T09:45:27.064Z] /build_hip/include/miopen/config.h:98:29: note: expanded from macro 'MIOPEN_HIP_COMPILER'
[2024-05-11T09:45:27.064Z]    98 | #define MIOPEN_HIP_COMPILER getHIPCompilerPath()
[2024-05-11T09:45:27.064Z]       |                             ^
[2024-05-11T09:45:27.064Z] /src/hip/hip_build_utils.cpp:186:13: note: to match this '('
[2024-05-11T09:45:27.064Z]   186 |             MIOPEN_THROW("Failed cmd: '" MIOPEN_HIP_COMPILER "', args: '" + args + '\'');
[2024-05-11T09:45:27.064Z]       |             ^
[2024-05-11T09:45:27.064Z] /src/include/miopen/errors.hpp:69:28: note: expanded from macro 'MIOPEN_THROW'
[2024-05-11T09:45:27.064Z]    69 |         miopen::MIOpenThrow(__FILE__, __LINE__, __VA_ARGS__); \

It looks like that

        if(!fs::exists(bin_file))
            MIOPEN_THROW("Failed cmd: '" MIOPEN_HIP_COMPILER "', args: '" + args + '\'');

has failed.

junliume avatar May 11 '24 22:05 junliume

@atamazov @apwojcik could you help to take a look? I know it is not easily reproducible and I am requesting an exact reproduce env now. Could you check statically what might be the potential issue? Is it permission to create temp dir?

junliume avatar May 11 '24 22:05 junliume

@apwojcik @JehandadKhan @atamazov another theory, the line: https://github.com/ROCm/MIOpen/blob/b99493b9d8517312c84434961e6bb4621907ec59/src/hip/hip_build_utils.cpp#L186 gets mis-matched quote marks. It was not triggered in normal situations, but on a node where no GPU is presented, it triggered the MIOPEN_THROW and thus produces this error. aka this throw is not properly tested unfortunately.

How about make it:

MIOPEN_THROW("Failed cmd: '" + MIOPEN_HIP_COMPILER + "', args: '" + args + '\'');

junliume avatar May 12 '24 18:05 junliume

@junliume

MIOPEN_THROW("Failed cmd: '" + MIOPEN_HIP_COMPILER + "', args: '" + args + '\'');

Almost like that, pls see https://github.com/ROCm/MIOpen/pull/2959#pullrequestreview-2051445141

atamazov avatar May 12 '24 19:05 atamazov

Thanks @atamazov especially it's afterhours :)

I am still puzzled why this is happening only now, likely staging has been building MIOpen on nodes without GPU for a while, but we only starts to throw such issues recently. So maybe nogpu backend should be fixed somehow? :)

junliume avatar May 12 '24 20:05 junliume

@junliume In the attached logfile I see: [2024-05-10T20:00:55.207Z] + cmake '-DCMAKE_PREFIX_PATH=/opt/rocm-6.2.0-488/llvm;/opt/rocm-6.2.0-488' '-DCMAKE_SHARED_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,$ORIGIN' '-DCMAKE_EXE_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,$ORIGIN/../lib' -DCMAKE_VERBOSE_MAKEFILE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=FALSE -DCMAKE_INSTALL_PREFIX=/opt/rocm-6.2.0-488 -DCMAKE_PACKAGING_INSTALL_PREFIX=/opt/rocm-6.2.0-488 -DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF -DROCM_SYMLINK_LIBS=OFF -DCPACK_PACKAGING_INSTALL_PREFIX=/opt/rocm-6.2.0-488 -DROCM_DISABLE_LDCONFIG=ON -DROCM_PATH=/opt/rocm-6.2.0-488 -DCPACK_DEBIAN_DEBUGINFO_PACKAGE=FALSE -DCPACK_RPM_DEBUGINFO_PACKAGE=FALSE -DCPACK_RPM_INSTALL_WITH_EXEC=FALSE -DCMAKE_BUILD_TYPE=Release -DMIOPEN_BACKEND=HIP -DMIOPEN_OFFLINE_COMPILER_PATHS_V2=1 -DCMAKE_CXX_COMPILER=/opt/rocm-6.2.0-488/llvm/bin/clang++ -DCMAKE_C_COMPILER=/opt/rocm-6.2.0-488/llvm/bin/clang '-DCMAKE_PREFIX_PATH=/opt/rocm-6.2.0-488;/opt/rocm-6.2.0-488/hip;/long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/deps' -DHIP_OC_COMPILER=/opt/rocm-6.2.0-488/bin/clang-ocl /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen.

I have no idea where all these options come from and why. Among them is -DMIOPEN_OFFLINE_COMPILER_PATHS_V2=1, which is an indirect source of the error.

So maybe nogpu backend should be fixed somehow? :)

Let's enable MIOPEN_OFFLINE_COMPILER_PATHS_V2 by default and see ;)

atamazov avatar May 13 '24 15:05 atamazov

-DMIOPEN_OFFLINE_COMPILER_PATHS_V2=1

@junliume In the attached logfile I see: [2024-05-10T20:00:55.207Z] + cmake '-DCMAKE_PREFIX_PATH=/opt/rocm-6.2.0-488/llvm;/opt/rocm-6.2.0-488' '-DCMAKE_SHARED_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,$ORIGIN' '-DCMAKE_EXE_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,$ORIGIN/../lib' -DCMAKE_VERBOSE_MAKEFILE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=FALSE -DCMAKE_INSTALL_PREFIX=/opt/rocm-6.2.0-488 -DCMAKE_PACKAGING_INSTALL_PREFIX=/opt/rocm-6.2.0-488 -DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF -DROCM_SYMLINK_LIBS=OFF -DCPACK_PACKAGING_INSTALL_PREFIX=/opt/rocm-6.2.0-488 -DROCM_DISABLE_LDCONFIG=ON -DROCM_PATH=/opt/rocm-6.2.0-488 -DCPACK_DEBIAN_DEBUGINFO_PACKAGE=FALSE -DCPACK_RPM_DEBUGINFO_PACKAGE=FALSE -DCPACK_RPM_INSTALL_WITH_EXEC=FALSE -DCMAKE_BUILD_TYPE=Release -DMIOPEN_BACKEND=HIP -DMIOPEN_OFFLINE_COMPILER_PATHS_V2=1 -DCMAKE_CXX_COMPILER=/opt/rocm-6.2.0-488/llvm/bin/clang++ -DCMAKE_C_COMPILER=/opt/rocm-6.2.0-488/llvm/bin/clang '-DCMAKE_PREFIX_PATH=/opt/rocm-6.2.0-488;/opt/rocm-6.2.0-488/hip;/long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/deps' -DHIP_OC_COMPILER=/opt/rocm-6.2.0-488/bin/clang-ocl /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen.

I have no idea where all these options come from and why. Among them is -DMIOPEN_OFFLINE_COMPILER_PATHS_V2=1, which is an indirect source of the error.

So maybe nogpu backend should be fixed somehow? :)

Let's enable MIOPEN_OFFLINE_COMPILER_PATHS_V2 by default and see ;)

@atamazov Yes! Using -DMIOPEN_OFFLINE_COMPILER_PATHS_V2=1 finally I can reproduce this issue (I should have checked the cmake options more carefully).

The option was added in https://github.com/ROCm/MIOpen/pull/2694 however it was not considered as default till we discovered it now.

BTW~ with #2959 it seems that we can build successfully even with this option enabled.

junliume avatar May 13 '24 16:05 junliume