AMDMIGraphX
AMDMIGraphX copied to clipboard
AMD's graph optimization engine.
To improve performance for transpose kernels we should load the transposed inputs into LDS directly, and then read from LDS instead. We have function like `preload_copy` which will do this...
Example series of instructions found in Longformer: ``` @145 = hip::copy(@121,@144) -> half_type, {4, 4, 256, 513}, {525312, 131328, 513, 1}: 0.0193204ms, 1% @146 = load[offset=8404992,end=12607488](@1) -> half_type, {4, 4,...
Fuse average pooling with convolution ``` @77 = gpu::code_object[code_object=9344,symbol_name=pad_kernel,global=262848,local=1024,](@57,@76) -> float_type, {1, 192, 37, 37}, {262848, 1369, 37, 1} @78 = load[offset=705600,end=1646400](@1) -> float_type, {1, 192, 35, 35}, {235200, 1225,...
From the 22 Feb 2024 performance model review of Distilgpt2: what Paul had suggested but it can go further because pointwise is also used once. e.g. pointwise kernel @55 here...
From the 22 Feb 2024 performance model review of Distilgpt2: There are several gemms that are applied together(this is the tailend of attention): ``` @17 = hip::hip_copy_literal[id=main:@literal:6] -> half_type, {348,...
From the 22 Feb 2024 performance model review of Distilgpt2: Although it might be minor, we could fuse a pointwise with gather so we can get rid of the extra...
From the 22 Feb 2024 performance model review of Distilgpt2: There is a where before the softmax which prevents us from using flash attention: ``` @34 = gpu::code_object[code_object=9224,symbol_name=where_kernel,global=363312,local=1024,](@33,@30,@32) -> half_type,...
Add additional flags to the MIGraphX Driver perf to allow for different timing methodologies to match how we run a model through onnxruntime. Handling things this way allows us to...
See discussions here : https://github.com/ROCm/AMDMIGraphX/pull/3299#issuecomment-2246075234