Renato Golin
Renato Golin
Ideas for speeding up BF16 on SPR using AMX: ## XSMM level fusion This is important for FP32 but more so for BF16/VNNI on AMX. This has been described in...
This will improve our benchmark strategy and should be a good chunk of work that we can upstream. We also need to track them somehow. Today we have `mean` and...
These are the known issues to reach libxsmm-dnn performance on "pre-packed layer" MLPs: - [x] Beta=Zero (see #777, #784) - [x] XSMM fusion (see #752) - [ ] Allocation on...
Right now there's only `f32` variant of print, not `bf16` so we had to use the vector lowering. But if we upstream a `printMemrefBF16` (see #554), then we can just...
Today we're working on a type packing for VNNI with the operation `tpp.vnni_pack`. But this isn't the only type of packing we may want, and they're all very similar, so...
Since #565 we have the ability to use libxsmm calls in the compiler. We're working on lowering `tensor.pack` into `tpp.copy` calls in a loop (#290) but the compile-time implementation (#336)...
[RunnerUtils.cpp](https://github.com/llvm/llvm-project/blob/main/mlir/lib/ExecutionEngine/RunnerUtils.cpp#L212) already has verifiers that we can use for equality. Instead of adding a new dialect, I think we just need a local utility builder that can lower to the...
The current [implementation](https://github.com/plaidml/tpp-mlir/blob/main/tpp-run/MLIRBench.cpp#L97) replaces dense tensors with random values, but this is restricted to `tpp-run`. For `tpp-opt` tests, we can't use that, and we end up using dense tensors, and...
As exposed in #492, TPP matchers all have asserts to make sure the number of ops is correct. This is a problem because: 1. It is poor software engineering practice...
The pass `conv-simplify` moves the bias add to the tensor initialization of a convolution if it's a zero-splat. This is common in matmul networks too, so we should make that...