[CI][Failure] shared_ptr_base.h:199:9: runtime error: member call on address which does not point to an object of type 'std::_Sp_counted_base<>'
Another byproduct of #3181
The error message:
/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:199:9: runtime error: member call on address 0x00000b9e6590 which does not point to an object of type 'std::_Sp_counted_base<>'
0x00000b9e6590: note: object has invalid vptr
00 00 00 00 d8 c0 dd 8e 53 7f 00 00 00 00 00 00 02 00 00 00 d9 01 00 00 00 00 00 00 30 d5 ac 10
^~~~~~~~~~~~~~~~~~~~~~~
invalid vptr
#0 0x7f53852a1bc7 (/data/MIOpen/build/lib/libMIOpen.so.1+0x29a5ebc7)
#1 0x7f538e2c87cb (/data/MIOpen/build/lib/libMIOpen.so.1+0x32a857cb)
#2 0x7f530f9c2d9e (/lib/x86_64-linux-gnu/libc.so.6+0x45d9e) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
#3 0x7f530f9c25c8 (/lib/x86_64-linux-gnu/libc.so.6+0x455c8) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
#4 0x7f530f9c260f (/lib/x86_64-linux-gnu/libc.so.6+0x4560f) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
#5 0x7f530f9a6d96 (/lib/x86_64-linux-gnu/libc.so.6+0x29d96) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
#6 0x7f530f9a6e3f (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
#7 0x249004 (/data/MIOpen/build/bin/test_find_db+0x249004)
[How to reproduce]:
cmake command:
CXX=/opt/rocm/llvm/bin/clang++ CXXFLAGS='-Werror' cmake -DMIOPEN_TEST_FLAGS=' --disable-verification-cache ' -DCMAKE_BUILD_TYPE=debug -DCMAKE_CXX_FLAGS_DEBUG='-g -fno-omit-frame-pointer -fsanitize=undefined -fno-sanitize-recover=undefined -Wno-option-ignored ' -DBUILD_DEV=Off -DMIOPEN_USE_MLIR=ON -DMIOPEN_GPU_SYNC=Off -DCMAKE_PREFIX_PATH=/opt/rocm ..
and then
LLVM_PATH=/opt/rocm/llvm CTEST_PARALLEL_LEVEL=4 make -j$(nproc) install check MIOpenDriver
@BrianHarrisonAMD @atamazov I suspect -fsanitize=undefined but need more investigation.
It must be one of these:
-DCMAKE_CXX_FLAGS_DEBUG='-g -fno-omit-frame-pointer -fsanitize=undefined -fno-sanitize-recover=undefined -Wno-option-ignored '
Update: confirmed it is due to -fsanitize=undefined
@junliume @amberhassaan @DrizztDoUrden AFAICS, UB is related to hipFree. I recommend checking if reverting #2524 resolves the issue.
@junliume @amberhassaan @DrizztDoUrden AFAICS, UB is related to hipFree. I recommend checking if reverting #2524 resolves the issue.
Unfortunately, in my short experiment reverting #2524 does not resolve this issue.
We do see lots of warning messages like:
Warning [hip_mem_get_info_wrapper] hipMemGetInfo error, status: 1
@junliume
We do see lots of warning messages like:
Warning [hip_mem_get_info_wrapper] hipMemGetInfo error, status: 1
IIRC sometimes we need to know the amount of free GPU memory and use hipMemGetInfo to query this info. But in some cases, this HIP function does fail, and I have no idea why. The workaround (which issues a warning and simply returns some fixed value) was introduced in #2333, https://github.com/ROCm/MIOpen/pull/2333/commits/6477e68cc0c98655700ec6dd83b1da7f608231c4
I suspect that the reason of HIP runtime failure is a combination of severely outdated base driver + new rocm in docker + some target asics. I think that we need some assistance from HIP runtime team.
@junliume ...but I do not think this is related to this specific issue with UB.
Not sure if this was already known, but I tracked it down to the test_find_db testsuite, and it appears to be from calling the following in solver_finders.cpp:
std::transform(
finders.begin(), finders.end(), std::inserter(solutions, solutions.end()), [&](auto&& f) {
return std::make_pair(f->GetAlgorithmName(problem),
f->Find(ctx, problem, invoke_ctx, parameters, options));
});
Seems to be calling Find on the finders causes this issue in the test.
Update, for the find_db.cpp test, I changed it to only run the forward test, and narrowed it down to miopen::solver::conv::ConvMlirIgemmFwdXdlops causing the above issue for me.
This change to mlo_dir_conv.cpp fixes the forwards test for me:
static auto GetImplicitGemmSolvers()
{
return miopen::solver::SolverContainer<
miopen::solver::conv::ConvHipImplicitGemmForwardV4R5Xdlops,
miopen::solver::conv::ConvHipImplicitGemmForwardV4R4Xdlops,
miopen::solver::conv::ConvHipImplicitGemmForwardV4R4Xdlops_Padded_Gemm,
miopen::solver::conv::ConvHipImplicitGemmBwdDataV4R1Xdlops,
miopen::solver::conv::ConvHipImplicitGemmBwdDataV1R1Xdlops,
miopen::solver::conv::ConvHipImplicitGemmV4R1Fwd,
miopen::solver::conv::ConvHipImplicitGemmV4R4Fwd,
// miopen::solver::conv::ConvMlirIgemmFwdXdlops,
miopen::solver::conv::ConvMlirIgemmFwd,
miopen::solver::conv::ConvMlirIgemmBwdXdlops,
miopen::solver::conv::ConvMlirIgemmBwd,
miopen::solver::conv::ConvHipImplicitGemmBwdDataV1R1,
miopen::solver::conv::ConvHipImplicitGemmBwdDataV4R1,
miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicFwd_1x1,
miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicFwd,
miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicBwd,
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdXdlops,
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicBwdXdlops,
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdXdlopsNHWC,
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicBwdXdlopsNHWC,
miopen::solver::conv::ConvCkIgemmFwdV6r1DlopsNchw,
#if MIOPEN_BACKEND_HIP && MIOPEN_USE_COMPOSABLEKERNEL
miopen::solver::conv::ConvHipImplicitGemmFwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmBwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmGroupFwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmGroupBwdXdlops,
miopen::solver::conv::ConvHipImplicitGemm3DGroupFwdXdlops,
miopen::solver::conv::ConvHipImplicitGemm3DGroupBwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmF16F8F16FwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmF16F8F16BwdXdlops,
#endif // MIOPEN_BACKEND_HIP && MIOPEN_USE_COMPOSABLEKERNEL
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdDlopsNCHWC>{};
}
Going to dig a bit deeper to see what's the issue with that one solver.
Edit: Looks like the issue happens for me if I call any of the miir API's, and goes away if I prevent those from happening.
This line is enough for it to trigger the issue for me:
miirCreateHandle(params.c_str());
Looks like it's due to the params the handle is created with, but not sure yet what caused this to be an issue now.
Adding a branch to suppress the ubsan errors since it's coming from MLIR handle creation, and our options are limited since we are using an older version.
PR up with suppression changes #3198
@BrianHarrisonAMD , @junliume : Do we know what causes the error? It can't be that shared_ptr_base.h is the culprit. Could we be ignoring some problem in our code by suppressing these errors?
@BrianHarrisonAMD , @junliume : Do we know what causes the error? It can't be that
shared_ptr_base.his the culprit. Could we be ignoring some problem in our code by suppressing these errors?
@amberhassaan shared_ptr_base.h isn't the issue, but it's where the ubsan error comes from during teardown of the application, and it's the only way I could find to suppress the error. The issue can be narrowed down to just creating a MLIR handle with nothing else happening, (I made a reproducer for that), and appears to be due to something in MLIR cleaning up static memory during exit.
This issue has been migrated to: https://github.com/ROCm/rocm-libraries/issues/874
Imported to ROCm/rocm-libraries