iree
iree copied to clipboard
[ROCM] Evaluate whether we can attach `amdgpu-no-implicitarg-ptr` to our generated functions.
If any implicit argument is used LLVM will reserve 256 bytes of kernarg space and emit metadata requiring the runtime to populate all implicit arguments. The only way to control this is to either not use any implicit arguments or to force it with the amdgpu-no-implicitarg-ptr
function attribute.
See https://github.com/llvm/llvm-project/blob/7f1b465c6ae476e59dc90652d58fc648932d23b1/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp#L299 for where the 256 bytes is specified and then https://github.com/llvm/llvm-project/blob/7f1b465c6ae476e59dc90652d58fc648932d23b1/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp#L389 where all the metadata is emitted. The runtime then slowly walks this metadata every dispatch and emits all of the arguments: https://github.com/ROCm/clr/blob/5da72f9d524420c43fe3eee44b11ac875d884e0f/rocclr/device/rocm/rocvirtual.cpp#L3197
If we don't need the implicit args we can reduce our overheads in kernarg space (dozens of bytes per dispatch instead of hundreds that we have today), the launch overheads of walking the metadata and writing the kernargs, and the waste of potential registers that kernarg preloading is doing when executing.
We should evaluate if we ever produce code that uses implicit args and see if we can avoid that. We may be able to force the amdgpu-no-implicitarg-ptr
and see if we get errors if we do try to use them and work back from there.