Renato Golin issues

Results 30 issues of


                                            Renato Golin

Bring BF16 VNNI performance up to same level as libxsmm-dnn

Ideas for speeding up BF16 on SPR using AMX: ## XSMM level fusion This is important for FP32 but more so for BF16/VNNI on AMX. This has been described in...

Add support in the `perf` dialect to start/stop perf counters when we start/stop the timer

This will improve our benchmark strategy and should be a good chunk of work that we can upstream. We also need to track them somehow. Today we have `mean` and...

enhancement

future

low-priority

Remaining Issues for MLP performance on par with libxsmm-dnn

These are the known issues to reach libxsmm-dnn performance on "pre-packed layer" MLPs: - [x] Beta=Zero (see #777, #784) - [x] XSMM fusion (see #752) - [ ] Allocation on...

Replace our vector print in `tpp-run` with upstream `printMemrefF32`

Right now there's only `f32` variant of print, not `bf16` so we had to use the vector lowering. But if we upstream a `printMemrefBF16` (see #554), then we can just...

Study type packing (VNNI/BFDOT/BFMMLA/etc) as a single operation

Today we're working on a type packing for VNNI with the operation `tpp.vnni_pack`. But this isn't the only type of packing we may want, and they're all very similar, so...

enhancement

low-priority

Re-implement compile-time tensor pack by calling libxsmm's IDENTITY function

Since #565 we have the ability to use libxsmm calls in the compiler. We're working on lowering `tensor.pack` into `tpp.copy` calls in a loop (#290) but the compile-time implementation (#336)...

Replace our check dialect with upstreaming to RunnerUtils.cpp

[RunnerUtils.cpp](https://github.com/llvm/llvm-project/blob/main/mlir/lib/ExecutionEngine/RunnerUtils.cpp#L212) already has verifiers that we can use for equality. Instead of adding a new dialect, I think we just need a local utility builder that can lower to the...

Move random initialisation of kernel tensors to a pass

The current [implementation](https://github.com/plaidml/tpp-mlir/blob/main/tpp-run/MLIRBench.cpp#L97) replaces dense tensors with random values, but this is restricted to `tpp-run`. For `tpp-opt` tests, we can't use that, and we end up using dense tensors, and...

enhancement

low-priority

Move all matchers to strict

As exposed in #492, TPP matchers all have asserts to make sure the number of ops is correct. This is a problem because: 1. It is poor software engineering practice...

stability

Generalize `conv-simplify` for matmuls, add residual optimization

The pass `conv-simplify` moves the bias add to the tensor initialization of a convolution if it's a zero-splat. This is common in matmul networks too, so we should make that...