Charlie Lin issues

Results 26 issues of


                                            Charlie Lin

Bitwise_and operator

* Implements the bitwise_and operator and ONNX parser. * Needed for support in TorchMIGraphX models

Onnx Operators

GEMM -> pointwise (GELU) -> GEMM fusion

From the 22 Feb 2024 performance model review of Distilgpt2: what Paul had suggested but it can go further because pointwise is also used once. e.g. pointwise kernel @55 here...

Perf Improve

Tier1

GEMM fusion (over slice or not)

From the 22 Feb 2024 performance model review of Distilgpt2: There are several gemms that are applied together(this is the tailend of attention): ``` @17 = hip::hip_copy_literal[id=main:@literal:6] -> half_type, {348,...

Perf Improve

Tier1

Fuse `gather` and `pointwise`

From the 22 Feb 2024 performance model review of Distilgpt2: Although it might be minor, we could fuse a pointwise with gather so we can get rid of the extra...

Perf Improve

Tier1

Fuse `where` into MLIR attention

From the 22 Feb 2024 performance model review of Distilgpt2: There is a where before the softmax which prevents us from using flash attention: ``` @34 = gpu::code_object[code_object=9224,symbol_name=where_kernel,global=363312,local=1024,](@33,@30,@32) -> half_type,...

Perf Improve

Tier1

Comment out qlinear reuse matcher

Comment out the qlinear_reused matcher because of an accuracy error for quantized resnet50

bugfix

Remove `qlinear_reused` matcher and instead fuse MLIR `quant_dot` with base pointwise operators

* There's an accuracy error in resulting from the `qlinear_reused` matcher in `simplify_qdq`. * Note that the other half of the quantized resnet50 accuracy issue was from a disconnect between...

bugfix

Perf Improve

Simplify using distributive property of matrix multiplication

* During the migraphx graph optimizations introduction presentation I showed a situation where we could have used the distributive property of matrix multiplication to produce a more optimized graph then...

enhancement

Perf Improve

Simplify `log(softmax(x))`

* With our changes to softmax we no longer use the log_softmax instruction that does the log and the softmax in one step. * We need to make a matcher...

Perf Improve

Use GPU intrinsics and HIP types for FP8 for MIGX JIT kernels

FP8