rocprofiler
rocprofiler copied to clipboard
Infinite recursion in librocprofiler.so
Hi,
I am a developer from the HPCToolkit project at Rice University. I am developing AMD GPU counter support in HPCToolkit directly using rocprofiler API. I am currently running into an infinite recursion in librocprofiler.so with the following stack trace:
#0 0x00007f38ba752e76 in d_print_comp_inner () from /lib64/libstdc++.so.6
#1 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#2 0x00007f38ba753c57 in d_print_comp_inner () from /lib64/libstdc++.so.6
#3 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#4 0x00007f38ba75584d in d_print_comp_inner () from /lib64/libstdc++.so.6
#5 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#6 0x00007f38ba75584d in d_print_comp_inner () from /lib64/libstdc++.so.6
#7 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#8 0x00007f38ba75584d in d_print_comp_inner () from /lib64/libstdc++.so.6
#9 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#10 0x00007f38ba757d02 in d_print_function_type.isra () from /lib64/libstdc++.so.6
#11 0x00007f38ba75474a in d_print_comp_inner () from /lib64/libstdc++.so.6
#12 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#13 0x00007f38ba754a42 in d_print_comp_inner () from /lib64/libstdc++.so.6
#14 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#15 0x00007f38ba75387d in d_print_comp_inner () from /lib64/libstdc++.so.6
#16 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#17 0x00007f38ba75907f in d_demangle_callback.constprop () from /lib64/libstdc++.so.6
#18 0x00007f38ba759361 in __cxa_demangle () from /lib64/libstdc++.so.6
#19 0x00007f38b0b51dd5 in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#20 0x00007f38b8cbfc6d in rocr::amd::hsa::loader::ExecutableImpl::IterateSymbols(hsa_status_t (*)(hsa_executable_s, hsa_executable_symbol_s, void*), void*) ()
from /opt/rocm-4.3.1/lib/libhsa-runtime64.so.1
#21 0x00007f38b8c9b853 in rocr::HSA::hsa_executable_iterate_symbols(hsa_executable_s, hsa_status_t (*)(hsa_executable_s, hsa_executable_symbol_s, void*), void*) ()
from /opt/rocm-4.3.1/lib/libhsa-runtime64.so.1
#22 0x00007f38b0b4fa4a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#23 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#24 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#25 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#26 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#27 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#28 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#29 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#30 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#31 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#32 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#33 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#34 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#35 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#36 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#37 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#38 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#39 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
You can see this is with rocm-4.3.1. With rocm-4.3.1, I was able to work around this issue by settings->code_obj_tracking = 0;
inside OnLoadToolProp
:
https://github.com/HPCToolkit/hpctoolkit/blob/rocprofiler_support/src/tool/hpcrun/gpu/amd/rocprofiler-api.c#L436
This work around does not seem to work with rocm-4.5.0 and I am seeing a similar infinite recursion in librocprofiler.
Are there any recommendations or insights on resolving this problem?
I recompiled rocprofiler from source for both rocm-4.3.1 and rocm-4.5.2 to have a better understanding of the problem. On surface, the infinite recursion happens because the interceptor function for code object freeze operation is calling itself (https://github.com/ROCm-Developer-Tools/rocprofiler/blob/rocm-4.3.x/src/util/hsa_rsrc_factory.cpp#L779).
Previously for rocm-4.3.1, I was able to work around this problem by disabling code object tracking. This workaround no longer works in rocm-4.5.2 as code object tracking is always enabled (https://github.com/ROCm-Developer-Tools/rocprofiler/blob/rocm-4.5.2/src/core/rocprofiler.cpp#L429). A user can still set the field of disabling code object tracking, but then be ignored. It would be really helpful to document these important internal changes.
Now back to the real problem that why the infinite recursion happened:
hsa_status_t HsaRsrcFactory::hsa_executable_freeze_interceptor(hsa_executable_t executable, const char *options) {
std::lock_guard<mutex_t> lck(mutex_);
if (symbols_map_ == NULL) symbols_map_ = new symbols_map_t;
hsa_status_t status = hsa_api_.hsa_executable_iterate_symbols(executable, executable_symbols_cb, NULL);
CHECK_STATUS("Error in iterating executable symbols", status);
return hsa_api_.hsa_executable_freeze(executable, options);
}
The function pointer in the return statement ends up with being HsaRsrcFactory::hsa_executable_freeze_interceptor
causing the recursion. hsa_api_
is set in function HsaRsrcFactory::InitHsaApiTable
(https://github.com/ROCm-Developer-Tools/rocprofiler/blob/rocm-4.3.x/src/util/hsa_rsrc_factory.cpp#L184). Based on my understanding, hsa_api_
records a set of actual HSA calls, so the interceptors can be interposed upon HSA calls.
HsaRsrcFactory::InitHsaApiTable
is called two places:
- https://github.com/ROCm-Developer-Tools/rocprofiler/blob/rocm-4.3.x/src/util/hsa_rsrc_factory.cpp#L127
- https://github.com/ROCm-Developer-Tools/rocprofiler/blob/rocm-4.3.x/src/core/rocprofiler.cpp#L114
When I trace in gdb, the following event happened:
-
InitHsaApiTable
is called in the first call site with input parameterNULL
. This leads tohsa_api_
to be initialized with function pointers to HSA API entries -
InitHsaApiTable
is called in the second call site with an input parameter, which represents the actual implementation of HSA APIs. This call does not update tohsa_api_
due to the first if statement at the beginning ofInitHsaApiTable
.
The fundamental issue here is that HSA API entries are just a wrapper function around the actual implementation function. For example, the API entry for hsa_executable_freeze
is shown as follow (I do not find HSA runtime source code, so I just disassemble the shared library)
0000000000067940 <hsa_executable_freeze>:
67940: 48 8b 05 89 2a 42 00 mov 0x422a89(%rip),%rax # 48a3d0 <_ZL12coreApiTable>
67947: ff a0 c0 02 00 00 jmpq *0x2c0(%rax)
It is clear to me that this function just reads a function table and then do a tail call to the actual implementation.
Now the problem is that if hsa_api_
points to this public entry function, which is just a wrapper, and then later rocprofiler updates the actual coreAPITable with its interceptor, we end up with the interceptor calling itself.
To resolve this problem, it looks to me that we should just remove the first if state in InitHsaApiTable
(https://github.com/ROCm-Developer-Tools/rocprofiler/blob/rocm-4.3.x/src/util/hsa_rsrc_factory.cpp#L187). This ensures that rocprofiler can get the actual hsa implementation functions provided by HSA runtime and can call hsa function when the hsa implementation table is not ready. With this change, at least locally I can resolve the infinite recursion problem.
While I would like to make a PR for this, I find that the rocprofiler github repo is in a quite strange state: Tag rocm-4.5.2 is shown to contain commits not in the repo. Both amd-master branch and the rocm-4.5.2 branch are behind the rocm-4.5.2 tag. Against which branch should I make the PR?
A similar infinite recursion showed up when using the code object URI callback in roctracer (https://github.com/ROCm-Developer-Tools/roctracer/blob/amd-master/test/app/codeobj_test.cpp#L60). After some investigation, I posted a PR against rocm-4.5.x branch for both instances of infinite recursion (#70)
Could you please create a small reproducer and share the command line for running the test? By default table->core->hsa_executable_freeze_fn is equal to rocr::HSA::hsa_executable_freeze. This means that the checking table for NULL is obsolete in that case.
@kikimych A reproducer uploaded. rocprofiler-test.tar.gz
Hi @mxz297,
I can reproduce this on current rocprof with your example code; thanks. I'll see if your patch in #70 fixes it (thank you for the PR too!), otherwise I'll have to keep looking into it.
@mxz297 Can you please check if your issue still exist in the latest ROCm 6.2? If resolved, please close the ticket. Thanks!
Hi @mxz297, thanks for your patience.
I was able to reproduce your issue on latest ROCm using the reproducer code provided. You are correct that the hsa_api_
table of functions that rocprofiler uses is not correctly being initialized. The reason is that before the rocprofiler tool can be used, the HSA runtime has to be initialized during which it loads in rocprofiler. However, HSA calls pthread_create during initialization and you have modified pthread_create to call rocprofiler_iterate_info, so we end up calling rocprofiler_iterate_info before HSA has completed initialization. This leads to the hsa_api_ being incorrectly initialized and causes the infinite recursion problem.
The solution here is to make sure HSA has been initialized (you can do this explicitly by calling hsa_init()
) completely before any calls to the rocprofiler API. Here, it is tricky since as mentioned before HSA uses pthread_create during initialization. If you wish to keep rocprofiler API calls inside your pthread_create, you would need to guarantee you call hsa_init before any calls to pthread_create happen (perhaps by using a __attribute__((constructor))
function to call hsa_init when libtest.so loads) and that these rocprofiler API calls are disabled until HSA has been initialized (by using a flag). I have attached the modified libtest.c code as an example fix.
Also, note that OnLoadToolProp
is called when HSA loads in rocprofiler so you can use that function for any rocprofiler related initialization that you need to do in your own code.
Please let me know if that fixes your issue!
@mxz297 I'm going to close this ticket due to inactivity. If the above fix does not work, feel free to re-open the ticket and we can look into it further.
Also, I want to mention that rocprofv1 is no longer under development and there is a new rocprofv3 released in ROCm 6.2 as a beta, which is built on top of the new rocprofiler-sdk. I strongly suggest using rocprofv3 since it is very close to having feature parity, has a lower overhead than v1 and v2, and is significantly better tested. Here is a link to the documentation for rocprofv3 (See the "Using rocprofv3" section).