iree Enable the `mmt4d` ukernel by default in `llvm-cpu`

The proposal (PR #15586) is to change the default value of iree-llvmcpu-enable-ukernels from none to mmt4d. Note that the mmt4d ukernel is the only ukernel used at all on llvm-cpu at the moment.

### Prerequisites
- [ ] #15784.
- [ ] #16399.
- [ ] #16400.
- [ ] #16198.

Feb 14 '24 02:02 bjacob

Those are probably too aggressive a pre-requisite. Those are all feature adds. The prerequisite would be

Turn on by default and check for crashes
Turn on by default and check for regressions.

Everything else we "fix forward"

Feb 14 '24 03:02 MaheshRavishankar

On several models, we have seen that when DT+UK is enabled and microkernels are not available, it ends up being 2-3x slower than DT on its own. This is still occurring after https://github.com/openxla/iree/pull/15883. Is it possible to ensure that DT+UK is never going to be slower than DT alone?

Feb 14 '24 05:02 mariecwhite

When we fall back to codegen, are we still somehow using the tile sizes from the materialize encoding pass?

Feb 14 '24 06:02 dcaballe

On several models, we have seen that when DT+UK is enabled and microkernels are not available, it ends up being 2-3x slower than DT on its own. This is still occurring after #15883. Is it possible to ensure that DT+UK is never going to be slower than DT alone?

There is not going to be a perfect system... The only way to ensure we arent going to go down a potentially slow path is to track and numbers and fix regressions AFAICS. So please do report any slow downs you see by turning microkernels for mmt4d on by default, and we can get an idea of what the gaps are.

Feb 14 '24 07:02 MaheshRavishankar

On several models, we have seen that when DT+UK is enabled and microkernels are not available, it ends up being 2-3x slower than DT on its own. This is still occurring after #15883. Is it possible to ensure that DT+UK is never going to be slower than DT alone?

I agree we shouldn't flip the switch with 2x-3x regressions. The good news is that a regression this big has got to have a simple explanation. Reading back, I remember better now: #15883 was not meant as a complete fix to #15784 (at least, it isn't one), it only tackles certain cases where we can tell at compile time that we don't have a ukernel for a mmt4d op based on element types. We need to fix the main part of #15883 now.

When we fall back to codegen, are we still somehow using the tile sizes from the materialize encoding pass?

Yes, but in almost all cases, these tile sizes are the same regardless of whether ukernels are enabled, so this could rarely explain an actual regression.

Feb 14 '24 13:02 bjacob

Yes, I agree with Benoit. I think we should make sure that the fall back to codegen is happening as expected. That should cover most of the concerns. We also have a table in CI called "Data-Tiling Comparison Table" that compares DT vs DT+UK so we may want to look at the outliers there. For example, from the CI run in https://github.com/openxla/iree/pull/16394:

DT/DT+UK=0.6X - MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 33.559 (1.0X) | N/A | 60.212 (0.6X)

DT/DT+UK=0.7X - MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 3.682 (1.0X) | N/A | 5.411 (0.7X)

I wonder if they all are due to the same issue.

When we fall back to codegen, are we still somehow using the tile sizes from the materialize encoding pass?

Yes, but in almost all cases, these tile sizes are the same regardless of whether ukernels are enabled, so this could rarely explain an actual regression.

If we end up using K>1 for mmt4d, codegen won't generate great code right now. We should make sure that if there is no ukernel available, we use the DT tile sizes.

Feb 14 '24 20:02 dcaballe