tpp-mlir
tpp-mlir copied to clipboard
TPP experimentation on MLIR for linear algebra
The PyTorch models we have in the benchmarks get a left-over `xsmm.zero` for the entire (unpacked) input in addition to the one inside the loop (that gets converted to beta=0...
Ideas for speeding up BF16 on SPR using AMX: ## XSMM level fusion This is important for FP32 but more so for BF16/VNNI on AMX. This has been described in...
This will improve our benchmark strategy and should be a good chunk of work that we can upstream. We also need to track them somehow. Today we have `mean` and...
These are the known issues to reach libxsmm-dnn performance on "pre-packed layer" MLPs: - [x] Beta=Zero (see #777, #784) - [x] XSMM fusion (see #752) - [ ] Allocation on...
Right now there's only `f32` variant of print, not `bf16` so we had to use the vector lowering. But if we upstream a `printMemrefBF16` (see #554), then we can just...
Similarly to `tpp-opt` pass modularization (#280), the second half of the lowering pipeline, present in `tpp-run` and executed after the default TPP pipeline, should be cleaned up, split into sub-passes,...
When re-writing a batch matmul to matmul we tile fully along the batch dimension, however when the tensors are fully dynamic, the scf.forall parallelization introduces an affine min map that...
Today we're working on a type packing for VNNI with the operation `tpp.vnni_pack`. But this isn't the only type of packing we may want, and they're all very similar, so...
Since #565 we have the ability to use libxsmm calls in the compiler. We're working on lowering `tensor.pack` into `tpp.copy` calls in a loop (#290) but the compile-time implementation (#336)...