HIP icon indicating copy to clipboard operation
HIP copied to clipboard

[Issue]: [HIP] Relocatable device code + LTO + static library still broken in ROCm 7.1

Open wtjones1 opened this issue 3 weeks ago • 4 comments

Problem Description

Hi HIP maintainers,

In ROCm 7.1.0 (hipcc reports HIP 7.1.25424, Clang 20.0.0git) the combination of

  • -fgpu-rdc (relocatable device code)
  • -flto (ThinLTO)
  • a static library containing device functions in separate translation units

still fails to link, even with --whole-archive.

What works

# Direct source linking (single hipcc invocation) → works with full LTO inlining
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a device_func1.hip device_func2.hip kernel.hip -o works

What fails (separate compilation + static lib)

hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -c device_func1.hip -o device_func1.o
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -c device_func2.hip -o device_func2.o
llvm-ar cru libDeviceFuncs.a device_func1.o device_func2.o
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -c kernel.hip -o kernel.o

# Variant 1 – normal -l
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -L. -lDeviceFuncs kernel.o -o fail1
→ undefined hidden symbol: __hip_gpubin_handle_…   (and __hip_fatbin_…)

# Variant 2 – --whole-archive
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -Wl,--whole-archive libDeviceFuncs.a -Wl,--no-whole-archive kernel.o -o fail2
→ ld.lld: error: libDeviceFuncs.a(device_func1.o): not an ELF file

Full verbose logs for both failing cases are attached / shown below (they contain the offload-bundler and lld invocations).

This has been a long-standing problem (see #2203, LLVM #77018, LLVM #778018). ROCm 7.1 was expected to improve parallel LTO with -fgpu-rdc, but the static-library path is still unusable for modular device libraries with LTO.

Many large codebases (including ours) need separate compilation + static device libraries + full LTO inlining for performance. The current workaround is either (a) drop LTO completely, or (b) concatenate all device code into one file. The latter is not an option for us.

Could you please confirm whether this is intended to be fixed in an upcoming 7.1.x point release or only in 7.2? Any known workarounds beyond the ones above would also be very welcome.

Thanks!

Bill Jones
NASA Langley Research Center

Operating System

Oracle Linux Server 8.6

CPU

AMD EPYC 7313 16-Core Processor

GPU

AMD Instinct MI210 (amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-)

ROCm Version

7.1.0 and previous

ROCm Component

llvm-project

Steps to Reproduce

Minimal reproducible example (any gfx target, here gfx90a)

// device_func1.hip
__device__ int add1(int x) { return x + 1; }

// device_func2.hip
__device__ int mul2(int x) { return x * 2; }

// kernel.hip
#include <hip/hip_runtime.h>
__global__ void mykernel(int *out, int in) {
    *out = mul2(add1(in));
}
int main() {
    int *d, h = 0;
    hipMalloc(&d, sizeof(int));
    hipLaunchKernelGGL(mykernel, 1, 1, 0, 0, d, 5);
    hipMemcpyDtoH(&h, d, sizeof(int));
    printf("Result: %d\n", h);   // expected 12
    hipFree(d);
    return 0;
}

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded

HSA System Attributes

Runtime Version: 1.18 Runtime Ext Version: 1.14 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED XNACK enabled: NO DMAbuf Support: NO VMM Support: NO

==========
HSA Agents


Agent 1


Name: AMD EPYC 7313 16-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7313 16-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3000
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 131729152(0x7da0700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 131729152(0x7da0700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 131729152(0x7da0700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 131729152(0x7da0700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: AMD EPYC 7313 16-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7313 16-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3000
BDFID: 0
Internal Node ID: 1
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 132113132(0x7dfe2ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 132113132(0x7dfe2ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 132113132(0x7dfe2ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 132113132(0x7dfe2ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 3


Name: gfx90a
Uuid: GPU-ed7d74702f8ea3a7
Marketing Name: AMD Instinct MI210
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29711(0x740f)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1700
BDFID: 50432
Internal Node ID: 2
Compute Unit: 104
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 82
SDMA engine uCode:: 8
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32


Agent 4


Name: gfx90a
Uuid: GPU-02d164651de1b058
Marketing Name: AMD Instinct MI210
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 3
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29711(0x740f)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1700
BDFID: 34048
Internal Node ID: 3
Compute Unit: 104
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 82
SDMA engine uCode:: 8
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
*** Done ***

Additional Information

log_works.txt log_fail_normal.txt log_fail_whole_archive.txt

wtjones1 avatar Dec 01 '25 16:12 wtjones1

@yxsamliu any ideas on this issue? Is it expected to work?

lamb-j avatar Dec 02 '25 20:12 lamb-j

In non-LTO rdc mode, device bitcode is embedded in host objects, and the host linker is able to link these host objects in the archive.

However, in LTO rdc mode, device bitcode and host bitcode are bundled as offload bundles, then the host linker does not know how to handle the bundles in the archive.

It seems the new offload driver is able to handle this use case, at least in llvm trunk. Try adding --offload-new-driver option for both compilation and linking.

yxsamliu avatar Dec 04 '25 18:12 yxsamliu

Just tested and confirmed that ROCm 7.1 works with --offload-new-driver. Compile with

hipcc -O2 -fgpu-rdc -flto --offload-new-driver -c device_func1.hip -o device_func1.o
hipcc -O2 -fgpu-rdc -flto --offload-new-driver -c device_func2.hip -o device_func2.o
llvm-ar cru libDeviceFuncs.a device_func1.o device_func2.o
hipcc -O2 -fgpu-rdc -flto --offload-new-driver -c kernel.hip -o kernel.o

Link successfully with --whole-archive

hipcc -O2 -fgpu-rdc -flto --offload-new-driver -Wl,--whole-archive libDeviceFuncs.a -Wl,--no-whole-archive kernel.o -o fail2 

and runs as expected

$ ./fail2 
Result: 12

zichguan-amd avatar Dec 04 '25 20:12 zichguan-amd

Hi @wtjones1, can you give it a try and let me know if it works for your workload?

zichguan-amd avatar Dec 05 '25 20:12 zichguan-amd