[LLVMCPU] Adjust tile sizes of unpack op
Adjust vector level tile sizes of unpack op having outer dims permutation. Propagates the max of tile size present at the index with the permutation.
The hint was taken from https://gist.github.com/Max191/2d6a74f4f7be1951ac359b6fd8db60ca
Now we have the same perf for unpack(with outer_dims perm) and unpack + transpose.
I think this does not actually address the root issue. The unpack from https://gist.github.com/Max191/2d6a74f4f7be1951ac359b6fd8db60ca looks like this:
%unpack = tensor.unpack %6 outer_dims_perm = [2, 0, 1] inner_dims_pos = [0, 1] inner_tiles = [16, 16] into %7 : tensor<64x1828x8x16x16xf32> -> tensor<29241x128x64xf32>I believe that the real reason for the poor performance is that the inner dimension of the result gets tiled to 1, and the write accesses must be unrolled. When the transpose is in the same dispatch, the tile sizes get propagated to the unpack, and the inner dimension is is tiled to 16.
Consider this similar unpack:
%unpack = tensor.unpack %6 outer_dims_perm = [0, 1, 2] inner_dims_pos = [0, 1] inner_tiles = [16, 16] into %7 : tensor<1828x8x64x16x16xf32> -> tensor<29241x128x64xf32>outer_dims_perm is identity, but the write accesses will still be bad. I think we actually need to be checking the
inner_dims_posto make sure the innermost dimension is not getting tiled to 1, and that should properly address the bad performance.
Thanks, @Max191. I will look into this.
I forgot to drop initial comments. I think having big tile sizes (e.g., 16x16x16) is not a good idea. We need some plan to codegen it properly. @pashu123 is going to start with unpack ukernel enablement and see where the gaps are.
(closing as stale)