ROCT-Thunk-Interface icon indicating copy to clipboard operation
ROCT-Thunk-Interface copied to clipboard

Missing symbols in 5.7.0 release

Open tpkessler opened this issue 2 years ago • 10 comments

Hi! I'm packaging ROCm for Arch Linux. The new release of hsa-rocr now calls functions like hsaKmtWaitOnEvent_Ext or hsaKmtReplaceAsanHeaderPage. However, these functions are not part of version 5.7.0: I couldn't find matching declarations or definitions in the source code.

Example output when building rocminfo:

/usr/bin/ld: /opt/rocm/lib/libhsa-runtime64.so.1.11.0: undefined reference to `hsaKmtWaitOnMultipleEvents_Ext'
/usr/bin/ld: /opt/rocm/lib/libhsa-runtime64.so.1.11.0: undefined reference to `hsaKmtReplaceAsanHeaderPage'
/usr/bin/ld: /opt/rocm/lib/libhsa-runtime64.so.1.11.0: undefined reference to `hsaKmtWaitOnEvent_Ext'

Note that the static library in the ubuntu package does include the symbols:

nm /opt/rocm-5.7.0/lib/libhsakmt.a | grep -E "_Ext|Asan"
0000000000001200 T hsaKmtWaitOnEvent_Ext
0000000000000970 T hsaKmtWaitOnMultipleEvents_Ext
0000000000001430 T hsaKmtReplaceAsanHeaderPage
0000000000001440 T hsaKmtReturnAsanHeaderPage

Where do I find correct source code?

tpkessler avatar Sep 17 '23 14:09 tpkessler

@fxkamd @kentrussell Can you tell me where I find the source code for the missing symbols?

tpkessler avatar Sep 29 '23 12:09 tpkessler

It should be on the rocm-5.7.x branch. E.g. https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/5268ea80e32a6f09f911114e12a20a8176055163/src/events.c#L220 and https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/5268ea80e32a6f09f911114e12a20a8176055163/src/memory.c#L598.

The AsanHeaderPage symbols are only compiled when you build for address sanitizer enabled.

fxkamd avatar Sep 29 '23 14:09 fxkamd

This came up internally a few weeks ago (in Fedora), and I was discussing it with one of our packaging guys. gcc is actually optimizing this out during compilation as part of a fun little quirk when you combine -flto=auto and -O . -O0 is fine, but any non-zero value of -O and -flto=auto will cause those 3 to get optimized out. And this is also only when building it as a shared object. Compiling as a static library gets around it, as well as dropping -flto=auto or dropping -O to 0 . Or enabling ASAN in your build.

kentrussell avatar Sep 29 '23 14:09 kentrussell

Thanks for the hints. Actually, the problem has nothing to do with lto or optimization flags. The issue was that the src/libhsakmt.ver wasn't updated to include the newly added symbols. By default, the symbols would be set to local instead of global. My PR fixes this.

tpkessler avatar Sep 30 '23 17:09 tpkessler

I'm a bit surprised that anyone is still building the Thunk as a shared library. We made it a static library for a reason: It has only a single user (ROCr, not counting KFDTest) and we want the freedom to break the ABI with ROCr. But I guess we never properly communicated and documented that.

fxkamd avatar Oct 02 '23 14:10 fxkamd

From a packager's perspective for a major Linux distribution I always prefer shared libraries. For ROCT this has worked since the first Arch packages for ROCm 3.x. I would therefore welcome a clear documentation of this. Or remove libhsakmt.ver as it doesn't seem to be actively maintained.

tpkessler avatar Oct 03 '23 16:10 tpkessler

The advantage of a dynamic library is that it doesn't required rebuilding dependent packages (i.e. ROCr in this case) if patches are manually cherry-picked in. It's not abnormal for distros to apply their own patches on top, or cherry-pick fixes from development branches before a release is available. CVE's are also particularly problematic with static libs.

I spoke to @kentrussell about this, but it seems we're on the same page that thunk source belongs in ROCr in the long term, as that's much clearer than the current situation. Having to call "cmake" twice on two different repos unintentionally hints to packagers that they are two distinct projects.

Mystro256 avatar Oct 03 '23 16:10 Mystro256

I agree, it would be great to see a combined repo for this, similar to clr for HIP and OpenCL runtimes.

tpkessler avatar Oct 03 '23 17:10 tpkessler

I'll pull this internally so we can get it released officially, since it's useful for people out there still using a thunk SO. Thanks for the help! I'll leave this open until we have the patch merged and published for all to see.

kentrussell avatar Oct 03 '23 17:10 kentrussell

Hi, this should be resolved in 6.0, but there's a new symbol in 6.0 you'll need to cherry-pick a4d46eb769313ddfdc238fdd358519f1c56bfa1f

Once the combined repo happens, we should be able to easily avoid this in the future.

Mystro256 avatar Jan 04 '24 17:01 Mystro256

Resolved this issue.

kentrussell avatar Aug 12 '24 14:08 kentrussell