tpp-mlir issues

Handle `linalg.max`

1

Enable linalg to XSMM lowering for `linalg.max`.

low-priority

PyTorch with `xsmm.zero` left-over before input online packing

The PyTorch models we have in the benchmarks get a left-over `xsmm.zero` for the entire (unpacked) input in addition to the one inside the loop (that gets converted to beta=0...

rengolin

enhancement

good first issue

low-priority

Bring BF16 VNNI performance up to same level as libxsmm-dnn

Ideas for speeding up BF16 on SPR using AMX: ## XSMM level fusion This is important for FP32 but more so for BF16/VNNI on AMX. This has been described in...

rengolin

Add support in the `perf` dialect to start/stop perf counters when we start/stop the timer

3

This will improve our benchmark strategy and should be a good chunk of work that we can upstream. We also need to track them somehow. Today we have `mean` and...

rengolin

enhancement

future

low-priority

Remaining Issues for MLP performance on par with libxsmm-dnn

1

These are the known issues to reach libxsmm-dnn performance on "pre-packed layer" MLPs: - [x] Beta=Zero (see #777, #784) - [x] XSMM fusion (see #752) - [ ] Allocation on...

rengolin

Replace our vector print in `tpp-run` with upstream `printMemrefF32`

1

Right now there's only `f32` variant of print, not `bf16` so we had to use the vector lowering. But if we upstream a `printMemrefBF16` (see #554), then we can just...

rengolin

Similarly to `tpp-opt` pass modularization (#280), the second half of the lowering pipeline, present in `tpp-run` and executed after the default TPP pipeline, should be cleaned up, split into sub-passes,...

adam-smnk

Canonicalize affine.min map when tensor are dynamics

When re-writing a batch matmul to matmul we tile fully along the batch dimension, however when the tensors are fully dynamic, the scf.forall parallelization introduces an affine min map that...

chelini

Study type packing (VNNI/BFDOT/BFMMLA/etc) as a single operation

Today we're working on a type packing for VNNI with the operation `tpp.vnni_pack`. But this isn't the only type of packing we may want, and they're all very similar, so...

rengolin

enhancement

low-priority

Re-implement compile-time tensor pack by calling libxsmm's IDENTITY function

Since #565 we have the ability to use libxsmm calls in the compiler. We're working on lowering `tensor.pack` into `tpp.copy` calls in a loop (#290) but the compile-time implementation (#336)...

rengolin

tpp-mlir
tpp-mlir copied to clipboard

Metadata

Handle `linalg.max`

PyTorch with `xsmm.zero` left-over before input online packing

Bring BF16 VNNI performance up to same level as libxsmm-dnn

Add support in the `perf` dialect to start/stop perf counters when we start/stop the timer

Remaining Issues for MLP performance on par with libxsmm-dnn

Replace our vector print in `tpp-run` with upstream `printMemrefF32`

Modularize backend passes

Canonicalize affine.min map when tensor are dynamics

Study type packing (VNNI/BFDOT/BFMMLA/etc) as a single operation

Re-implement compile-time tensor pack by calling libxsmm's IDENTITY function

← Metadata

Owner

Metadata

tpp-mlir tpp-mlir copied to clipboard

Metadata

← Metadata

Owner

Metadata

tpp-mlir
tpp-mlir copied to clipboard