Enable the `mmt4d` ukernel by default in `llvm-cpu`
The proposal (PR #15586) is to change the default value of iree-llvmcpu-enable-ukernels from none to mmt4d. Note that the mmt4d ukernel is the only ukernel used at all on llvm-cpu at the moment.
### Prerequisites
- [ ] #15784.
- [ ] #16399.
- [ ] #16400.
- [ ] #16198.
Those are probably too aggressive a pre-requisite. Those are all feature adds. The prerequisite would be
- Turn on by default and check for crashes
- Turn on by default and check for regressions.
Everything else we "fix forward"
On several models, we have seen that when DT+UK is enabled and microkernels are not available, it ends up being 2-3x slower than DT on its own. This is still occurring after https://github.com/openxla/iree/pull/15883. Is it possible to ensure that DT+UK is never going to be slower than DT alone?
When we fall back to codegen, are we still somehow using the tile sizes from the materialize encoding pass?
On several models, we have seen that when
DT+UKis enabled and microkernels are not available, it ends up being 2-3x slower thanDTon its own. This is still occurring after #15883. Is it possible to ensure thatDT+UKis never going to be slower thanDTalone?
There is not going to be a perfect system... The only way to ensure we arent going to go down a potentially slow path is to track and numbers and fix regressions AFAICS. So please do report any slow downs you see by turning microkernels for mmt4d on by default, and we can get an idea of what the gaps are.
On several models, we have seen that when
DT+UKis enabled and microkernels are not available, it ends up being 2-3x slower thanDTon its own. This is still occurring after #15883. Is it possible to ensure thatDT+UKis never going to be slower thanDTalone?
I agree we shouldn't flip the switch with 2x-3x regressions. The good news is that a regression this big has got to have a simple explanation. Reading back, I remember better now: #15883 was not meant as a complete fix to #15784 (at least, it isn't one), it only tackles certain cases where we can tell at compile time that we don't have a ukernel for a mmt4d op based on element types. We need to fix the main part of #15883 now.
When we fall back to codegen, are we still somehow using the tile sizes from the materialize encoding pass?
Yes, but in almost all cases, these tile sizes are the same regardless of whether ukernels are enabled, so this could rarely explain an actual regression.
Yes, I agree with Benoit. I think we should make sure that the fall back to codegen is happening as expected. That should cover most of the concerns. We also have a table in CI called "Data-Tiling Comparison Table" that compares DT vs DT+UK so we may want to look at the outliers there. For example, from the CI run in https://github.com/openxla/iree/pull/16394:
DT/DT+UK=0.6X - MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 33.559 (1.0X) | N/A | 60.212 (0.6X)
DT/DT+UK=0.7X - MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 3.682 (1.0X) | N/A | 5.411 (0.7X)
I wonder if they all are due to the same issue.
When we fall back to codegen, are we still somehow using the tile sizes from the materialize encoding pass?
Yes, but in almost all cases, these tile sizes are the same regardless of whether ukernels are enabled, so this could rarely explain an actual regression.
If we end up using K>1 for mmt4d, codegen won't generate great code right now. We should make sure that if there is no ukernel available, we use the DT tile sizes.