AMDMIGraphX issues

Use blockwise reduction for pooling

Perf Improve

Tier1

Preload transposed inputs in LDS for pointwise kernels

1

To improve performance for transpose kernels we should load the transposed inputs into LDS directly, and then read from LDS instead. We have function like `preload_copy` which will do this...

pfultz2

Perf Improve

Tier1

Scatter elements simplification

1

Example series of instructions found in Longformer: ``` @145 = hip::copy(@121,@144) -> half_type, {4, 4, 256, 513}, {525312, 131328, 513, 1}: 0.0193204ms, 1% @146 = load[offset=8404992,end=12607488](@1) -> half_type, {4, 4,...

shivadbhavsar

Perf Improve

Fuse Average Pooling with Convolution

1

Fuse average pooling with convolution ``` @77 = gpu::code_object[code_object=9344,symbol_name=pad_kernel,global=262848,local=1024,](@57,@76) -> float_type, {1, 192, 37, 37}, {262848, 1369, 37, 1} @78 = load[offset=705600,end=1646400](@1) -> float_type, {1, 192, 35, 35}, {235200, 1225,...

TedThemistokleous

Perf Improve

GEMM -> pointwise (GELU) -> GEMM fusion

From the 22 Feb 2024 performance model review of Distilgpt2: what Paul had suggested but it can go further because pointwise is also used once. e.g. pointwise kernel @55 here...

CharlieL7

Perf Improve

Tier1

GEMM fusion (over slice or not)

1

From the 22 Feb 2024 performance model review of Distilgpt2: There are several gemms that are applied together(this is the tailend of attention): ``` @17 = hip::hip_copy_literal[id=main:@literal:6] -> half_type, {348,...

CharlieL7

Perf Improve

Tier1

Fuse `gather` and `pointwise`

From the 22 Feb 2024 performance model review of Distilgpt2: Although it might be minor, we could fuse a pointwise with gather so we can get rid of the extra...

CharlieL7

Perf Improve

Tier1

Fuse `where` into MLIR attention

From the 22 Feb 2024 performance model review of Distilgpt2: There is a where before the softmax which prevents us from using flash attention: ``` @34 = gpu::code_object[code_object=9224,symbol_name=where_kernel,global=363312,local=1024,](@33,@30,@32) -> half_type,...

CharlieL7

Perf Improve

Tier1

Update MIGraphX Driver to change run timing and stream synchronization for runs

Add additional flags to the MIGraphX Driver perf to allow for different timing methodologies to match how we run a model through onnxruntime. Handling things this way allows us to...

TedThemistokleous

onnxruntime

dependencies

Fix implicit assumption for inputs of the consumer op when fusing MLIR ops

See discussions here : https://github.com/ROCm/AMDMIGraphX/pull/3299#issuecomment-2246075234

umangyadav

AMDMIGraphX
AMDMIGraphX copied to clipboard

Metadata

Use blockwise reduction for pooling

Preload transposed inputs in LDS for pointwise kernels

Scatter elements simplification

Fuse Average Pooling with Convolution

GEMM -> pointwise (GELU) -> GEMM fusion

GEMM fusion (over slice or not)

Fuse `gather` and `pointwise`

Fuse `where` into MLIR attention

Update MIGraphX Driver to change run timing and stream synchronization for runs

Fix implicit assumption for inputs of the consumer op when fusing MLIR ops

← Metadata

Owner

Metadata

AMDMIGraphX AMDMIGraphX copied to clipboard

Metadata

← Metadata

Owner

Metadata

AMDMIGraphX
AMDMIGraphX copied to clipboard