ROCR-Runtime icon indicating copy to clipboard operation
ROCR-Runtime copied to clipboard

hsa_executable_get_symbol_by_name is broken for V3 code metadata?

Open jpsamaroo opened this issue 4 years ago • 5 comments

I am the maintainer of Julia's AMDGPU computing stack, and was recently upgrading our ROCR-Runtime wrapper package HSARuntime.jl to support builds of Julia that use LLVM >= 7. Those versions of LLVM switch the default code metadata format from V2 to V3, which have separate code paths within ROCR. I ran into an issue during this upgrade where hsa_executable_get_symbol_by_name fails to query a symbol in an executable using V3 metadata; the specific error is HSA_STATUS_ERROR_INVALID_SYMBOL_NAME. By examining my emitted binary with readelf, I saw the two symbols that I should expect to see according to the LLVM AMDGPU User Guide, symbol_name and symbol_name.kd, with the correct ELF type and section. However, specifying neither symbol_name nor symbol_name.kd to the above function worked.

I was eventually able to create a workaround for this issue by iterating all agent symbols with hsa_executable_iterate_agent_symbols and selecting the first kernel symbol found (which currently is fine, since our stack's compiler only emits one agent kernel per binary), after seeing a similar approach used by HIP. While I'm okay with leaving this workaround in place for now, I'd like to ask if this behavior is intended, or is it a bug in ROCR that isn't tested for?

Thanks!

jpsamaroo avatar Apr 19 '20 18:04 jpsamaroo

This seems like a bug. The symbol to use to denote a kernel in a AQL dispatch packet can be obtained from the V3 metadata, and that symbol should be able to be queried to get its address. It seems you have found that is broken and so needs fixing. Thanks for reporting.

t-tye avatar Apr 19 '20 18:04 t-tye

This seems like a bug. The symbol to use to denote a kernel in a AQL dispatch packet can be obtained from the V3 metadata, and that symbol should be able to be queried to get its address. It seems you have found that is broken and so needs fixing. Thanks for reporting.

tmp.zip A simple code object file with v3 metadata is attached.

yxsamliu avatar Apr 20 '20 20:04 yxsamliu

I am running into the same issue.

Using agent: gfx1030 hsa_executable_get_symbol failed: HSA_STATUS_ERROR_INVALID_SYMBOL_NAME: There is no symbol with the given name.

Failed

My kernel is the example from https://llvm.org/docs/AMDGPUUsage.html#code-object-v3-to-v4-example-source-code

I am on AOMP 13.x (latest as of today) and latest rocr.

powderluv avatar Feb 21 '21 09:02 powderluv

Hello, I took a quick look at the code object attached. This is the intended behaviour.

The provided code object has the following dynsym:

Symbol table '.dynsym' contains 7 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000002400   868 FUNC    GLOBAL PROTECTED    7 _Z16kernel_dot_part1ILj256EEviPKdS1_Pd
     2: 0000000000001140    64 OBJECT  GLOBAL PROTECTED    6 _Z16kernel_dot_part1ILj256EEviPKdS1_Pd.kd
     3: 0000000000002000   836 FUNC    GLOBAL PROTECTED    7 _Z16kernel_dot_part2ILj256EEviPKdPd
     4: 0000000000001100    64 OBJECT  GLOBAL PROTECTED    6 _Z16kernel_dot_part2ILj256EEviPKdPd.kd
     5: 0000000000002800   548 FUNC    GLOBAL PROTECTED    7 _Z16kernel_dot_part3ILj256EEvPd
     6: 0000000000001180    64 OBJECT  GLOBAL PROTECTED    6 _Z16kernel_dot_part3ILj256EEvPd.kd

I can query all symbols ending in ".kd" successfully using hsa_executable_get_symbol_by_name (kernel descriptor, goes into dispatch packet, etc.).

I get HSA_STATUS_ERROR_INVALID_SYMBOL_NAME error when I query kernel entry point symbols. Which is expected. Querying of kernel entry point symbols is not supported.

The language runtime should inspect the metadata and find the appropriate kernel descriptor symbol:

https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3

kzhuravl avatar Mar 10 '21 21:03 kzhuravl

@kzhuravl what version of ROCm are you using to test this? I remember when testing this that I wasn't able to query the .kd symbols by name either; if that had worked, I wouldn't have filed this issue.

Regardless of that, this feels like an overly breaking change. It would have been nice if the runtime would check if the specified string didn't end in .kd, and if so and the loaded executable is using V3 metadata, print a warning that we shouldn't be using that, and then automatically append .kd and use that symbol. It's extra work for the runtime, but also would have been a nice deprecation mechanism.

Also, if you have any suggestions on how to access just the HSA metadata from either the LLVM C API (not C++) or with LLVM command line tools, I'd appreciate it greatly :slightly_smiling_face:

jpsamaroo avatar Mar 11 '21 13:03 jpsamaroo