iree
iree copied to clipboard
Optimize narrow-M `mmt4d` ukernel tile functions
We have mmt4d ukernel tile functions for a bunch of narrow-M cases, but they have been added as naive truncations of the general case. Often, that's fine. Sometimes, that results in convoluted and inefficient ukernels. Thinking particularly of the int8 ukernels on x86-64.