ROCT-Thunk-Interface
ROCT-Thunk-Interface copied to clipboard
Missing symbols in 5.7.0 release
Hi! I'm packaging ROCm for Arch Linux. The new release of hsa-rocr now calls functions like hsaKmtWaitOnEvent_Ext or hsaKmtReplaceAsanHeaderPage. However, these functions are not part of version 5.7.0: I couldn't find matching declarations or definitions in the source code.
Example output when building rocminfo:
/usr/bin/ld: /opt/rocm/lib/libhsa-runtime64.so.1.11.0: undefined reference to `hsaKmtWaitOnMultipleEvents_Ext'
/usr/bin/ld: /opt/rocm/lib/libhsa-runtime64.so.1.11.0: undefined reference to `hsaKmtReplaceAsanHeaderPage'
/usr/bin/ld: /opt/rocm/lib/libhsa-runtime64.so.1.11.0: undefined reference to `hsaKmtWaitOnEvent_Ext'
Note that the static library in the ubuntu package does include the symbols:
nm /opt/rocm-5.7.0/lib/libhsakmt.a | grep -E "_Ext|Asan"
0000000000001200 T hsaKmtWaitOnEvent_Ext
0000000000000970 T hsaKmtWaitOnMultipleEvents_Ext
0000000000001430 T hsaKmtReplaceAsanHeaderPage
0000000000001440 T hsaKmtReturnAsanHeaderPage
Where do I find correct source code?
@fxkamd @kentrussell Can you tell me where I find the source code for the missing symbols?
It should be on the rocm-5.7.x branch. E.g. https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/5268ea80e32a6f09f911114e12a20a8176055163/src/events.c#L220 and https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/5268ea80e32a6f09f911114e12a20a8176055163/src/memory.c#L598.
The AsanHeaderPage symbols are only compiled when you build for address sanitizer enabled.
This came up internally a few weeks ago (in Fedora), and I was discussing it with one of our packaging guys. gcc is actually optimizing this out during compilation as part of a fun little quirk when you combine -flto=auto and -O . -O0 is fine, but any non-zero value of -O and -flto=auto will cause those 3 to get optimized out. And this is also only when building it as a shared object. Compiling as a static library gets around it, as well as dropping -flto=auto or dropping -O to 0 . Or enabling ASAN in your build.
Thanks for the hints. Actually, the problem has nothing to do with lto or optimization flags. The issue was that the src/libhsakmt.ver wasn't updated to include the newly added symbols. By default, the symbols would be set to local instead of global. My PR fixes this.
I'm a bit surprised that anyone is still building the Thunk as a shared library. We made it a static library for a reason: It has only a single user (ROCr, not counting KFDTest) and we want the freedom to break the ABI with ROCr. But I guess we never properly communicated and documented that.
From a packager's perspective for a major Linux distribution I always prefer shared libraries. For ROCT this has worked since the first Arch packages for ROCm 3.x. I would therefore welcome a clear documentation of this. Or remove libhsakmt.ver as it doesn't seem to be actively maintained.
The advantage of a dynamic library is that it doesn't required rebuilding dependent packages (i.e. ROCr in this case) if patches are manually cherry-picked in. It's not abnormal for distros to apply their own patches on top, or cherry-pick fixes from development branches before a release is available. CVE's are also particularly problematic with static libs.
I spoke to @kentrussell about this, but it seems we're on the same page that thunk source belongs in ROCr in the long term, as that's much clearer than the current situation. Having to call "cmake" twice on two different repos unintentionally hints to packagers that they are two distinct projects.
I agree, it would be great to see a combined repo for this, similar to clr for HIP and OpenCL runtimes.
I'll pull this internally so we can get it released officially, since it's useful for people out there still using a thunk SO. Thanks for the help! I'll leave this open until we have the patch merged and published for all to see.
Hi, this should be resolved in 6.0, but there's a new symbol in 6.0 you'll need to cherry-pick a4d46eb769313ddfdc238fdd358519f1c56bfa1f
Once the combined repo happens, we should be able to easily avoid this in the future.
Resolved this issue.