[CPU][ArmSME] Update tiling to use all SME accumulators
Previously, we only tiled for a single SME accumulator. This patch updates the lowering_config to make use of all SME accumulators.
This is done by increasing the tile size to [8]x[8] for f32 and to [4]x[8] for f64. This lowers to four [4]x[4] 32-bit accumulators and eight [2]x[2] 64-bit accumulators respectively.
These tile sizes need some additional vector legalization passes, which have now been added to the ArmSME pipeline.
cc @c-rhodes
Note: This patch now needs a fix upstream due to #16350 (we need to legalize arith.constants), should be a simple fix but I'll have to wait until the next LLVM integration.
Edit: There a few more issues to look into :pensive:
@hanhanW, @MaheshRavishankar if this looks okay, could we land this today? :pray:
P.s. I don't have write access.