[Issue]: [HIP] Relocatable device code + LTO + static library still broken in ROCm 7.1
Problem Description
Hi HIP maintainers,
In ROCm 7.1.0 (hipcc reports HIP 7.1.25424, Clang 20.0.0git) the combination of
-fgpu-rdc(relocatable device code)-flto(ThinLTO)- a static library containing device functions in separate translation units
still fails to link, even with --whole-archive.
What works
# Direct source linking (single hipcc invocation) → works with full LTO inlining
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a device_func1.hip device_func2.hip kernel.hip -o works
What fails (separate compilation + static lib)
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -c device_func1.hip -o device_func1.o
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -c device_func2.hip -o device_func2.o
llvm-ar cru libDeviceFuncs.a device_func1.o device_func2.o
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -c kernel.hip -o kernel.o
# Variant 1 – normal -l
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -L. -lDeviceFuncs kernel.o -o fail1
→ undefined hidden symbol: __hip_gpubin_handle_… (and __hip_fatbin_…)
# Variant 2 – --whole-archive
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -Wl,--whole-archive libDeviceFuncs.a -Wl,--no-whole-archive kernel.o -o fail2
→ ld.lld: error: libDeviceFuncs.a(device_func1.o): not an ELF file
Full verbose logs for both failing cases are attached / shown below (they contain the offload-bundler and lld invocations).
This has been a long-standing problem (see #2203, LLVM #77018, LLVM #778018). ROCm 7.1 was expected to improve parallel LTO with -fgpu-rdc, but the static-library path is still unusable for modular device libraries with LTO.
Many large codebases (including ours) need separate compilation + static device libraries + full LTO inlining for performance. The current workaround is either (a) drop LTO completely, or (b) concatenate all device code into one file. The latter is not an option for us.
Could you please confirm whether this is intended to be fixed in an upcoming 7.1.x point release or only in 7.2? Any known workarounds beyond the ones above would also be very welcome.
Thanks!
Bill Jones
NASA Langley Research Center
Operating System
Oracle Linux Server 8.6
CPU
AMD EPYC 7313 16-Core Processor
GPU
AMD Instinct MI210 (amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-)
ROCm Version
7.1.0 and previous
ROCm Component
llvm-project
Steps to Reproduce
Minimal reproducible example (any gfx target, here gfx90a)
// device_func1.hip
__device__ int add1(int x) { return x + 1; }
// device_func2.hip
__device__ int mul2(int x) { return x * 2; }
// kernel.hip
#include <hip/hip_runtime.h>
__global__ void mykernel(int *out, int in) {
*out = mul2(add1(in));
}
int main() {
int *d, h = 0;
hipMalloc(&d, sizeof(int));
hipLaunchKernelGGL(mykernel, 1, 1, 0, 0, d, 5);
hipMemcpyDtoH(&h, d, sizeof(int));
printf("Result: %d\n", h); // expected 12
hipFree(d);
return 0;
}
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
ROCk module is loaded
HSA System Attributes
Runtime Version: 1.18
Runtime Ext Version: 1.14
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: NO
VMM Support: NO
==========
HSA Agents
Agent 1
Name: AMD EPYC 7313 16-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7313 16-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3000
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 131729152(0x7da0700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 131729152(0x7da0700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 131729152(0x7da0700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 131729152(0x7da0700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: AMD EPYC 7313 16-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7313 16-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3000
BDFID: 0
Internal Node ID: 1
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 132113132(0x7dfe2ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 132113132(0x7dfe2ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 132113132(0x7dfe2ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 132113132(0x7dfe2ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 3
Name: gfx90a
Uuid: GPU-ed7d74702f8ea3a7
Marketing Name: AMD Instinct MI210
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29711(0x740f)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1700
BDFID: 50432
Internal Node ID: 2
Compute Unit: 104
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 82
SDMA engine uCode:: 8
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
Agent 4
Name: gfx90a
Uuid: GPU-02d164651de1b058
Marketing Name: AMD Instinct MI210
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 3
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29711(0x740f)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1700
BDFID: 34048
Internal Node ID: 3
Compute Unit: 104
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 82
SDMA engine uCode:: 8
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
log_works.txt log_fail_normal.txt log_fail_whole_archive.txt
@yxsamliu any ideas on this issue? Is it expected to work?
In non-LTO rdc mode, device bitcode is embedded in host objects, and the host linker is able to link these host objects in the archive.
However, in LTO rdc mode, device bitcode and host bitcode are bundled as offload bundles, then the host linker does not know how to handle the bundles in the archive.
It seems the new offload driver is able to handle this use case, at least in llvm trunk. Try adding --offload-new-driver option for both compilation and linking.
Just tested and confirmed that ROCm 7.1 works with --offload-new-driver.
Compile with
hipcc -O2 -fgpu-rdc -flto --offload-new-driver -c device_func1.hip -o device_func1.o
hipcc -O2 -fgpu-rdc -flto --offload-new-driver -c device_func2.hip -o device_func2.o
llvm-ar cru libDeviceFuncs.a device_func1.o device_func2.o
hipcc -O2 -fgpu-rdc -flto --offload-new-driver -c kernel.hip -o kernel.o
Link successfully with --whole-archive
hipcc -O2 -fgpu-rdc -flto --offload-new-driver -Wl,--whole-archive libDeviceFuncs.a -Wl,--no-whole-archive kernel.o -o fail2
and runs as expected
$ ./fail2
Result: 12
Hi @wtjones1, can you give it a try and let me know if it works for your workload?