Benoit Jacob issues

Results 54 issues of


                                            Benoit Jacob

Generalize `QuantizedMatmulToMatmul` to handle `linalg.quantized_batch_matmul`.

# Context Consider the following 4 linalg named ops and how they differ from each other, by comparing their definitions in the Linalg OpDSL (follow the links): 1. [matmul](https://github.com/llvm/llvm-project/blob/680c780a367bfe1c0cdf786250fd7f565ef6d23d/mlir/python/mlir/dialects/linalg/opdsl/ops/core_named_ops.py#L242-L256) 2....

compiler/dialects

performance ⚡

LowerToUKernels should create a codegen fallback when the ukernel would be slow.

Ukernels do not expose whether they have fast code to handle a given case. They always succeed, potentially falling back on a slow generic tile function that can be 100x...

codegen

performance ⚡

codegen/llvm

Update tracy docs regarding `--iree-hal-dump-executable-sources-to=`

As pointed out by @KoolJBlack , it is now necessary to pass `--iree-hal-dump-executable-sources-to=` for Tracy sampling to work. That is because Tracy sampling relies on the presence of source files,...

Enable the `mmt4d` ukernel by default in `llvm-cpu`

The proposal (PR #15586) is to change the default value of `iree-llvmcpu-enable-ukernels` from `none` to `mmt4d`. Note that the `mmt4d` ukernel is the only ukernel used at all on `llvm-cpu`...

performance ⚡

codegen/llvm

Optimize narrow-M `mmt4d` ukernel tile functions

We have `mmt4d` ukernel tile functions for a bunch of narrow-M cases, but they have been added as naive truncations of the general case. Often, that's fine. Sometimes, that results...

performance ⚡

codegen/llvm

Transpose narrow matmuls so that the narrow dimension is M

Narrow-N and narrow-M cases of matmuls are entirely similar. There is no need to write data-tiling logic and ukernels for all combinations of the two narrow dimensions. We have standardized...

performance ⚡

codegen/llvm

Benoit Jacob

Generalize `QuantizedMatmulToMatmul` to handle `linalg.quantized_batch_matmul`.

LowerToUKernels should create a codegen fallback when the ukernel would be slow.

Update tracy docs regarding `--iree-hal-dump-executable-sources-to=`

Enable the `mmt4d` ukernel by default in `llvm-cpu`

Optimize narrow-M `mmt4d` ukernel tile functions

Transpose narrow matmuls so that the narrow dimension is M

8bit*8bit 4-D dot-product accumulating to 32bit, similar to ARM SDOT and x86 VNNI

Drop tracy from CI benchmarks?

Test a LLVM integrate with https://github.com/llvm/llvm-project/pull/91800

Add a lowering of `vector.interleave` to `vector.shuffle`