iree [LLVMCPU] Adjust tile sizes of unpack op

Adjust vector level tile sizes of unpack op having outer dims permutation. Propagates the max of tile size present at the index with the permutation.

May 22 '24 06:05 pashu123

The hint was taken from https://gist.github.com/Max191/2d6a74f4f7be1951ac359b6fd8db60ca

May 22 '24 06:05 pashu123

Now we have the same perf for unpack(with outer_dims perm) and unpack + transpose.

May 22 '24 06:05 pashu123

I think this does not actually address the root issue. The unpack from https://gist.github.com/Max191/2d6a74f4f7be1951ac359b6fd8db60ca looks like this:
%unpack = tensor.unpack %6
    outer_dims_perm = [2, 0, 1]
    inner_dims_pos = [0, 1]
    inner_tiles = [16, 16]
    into %7 : tensor<64x1828x8x16x16xf32> -> tensor<29241x128x64xf32>
I believe that the real reason for the poor performance is that the inner dimension of the result gets tiled to 1, and the write accesses must be unrolled. When the transpose is in the same dispatch, the tile sizes get propagated to the unpack, and the inner dimension is is tiled to 16.

Consider this similar unpack:
%unpack = tensor.unpack %6
    outer_dims_perm = [0, 1, 2]
    inner_dims_pos = [0, 1]
    inner_tiles = [16, 16]
    into %7 : tensor<1828x8x64x16x16xf32> -> tensor<29241x128x64xf32>
outer_dims_perm is identity, but the write accesses will still be bad. I think we actually need to be checking the inner_dims_pos to make sure the innermost dimension is not getting tiled to 1, and that should properly address the bad performance.

Thanks, @Max191. I will look into this.

May 23 '24 15:05 pashu123

I forgot to drop initial comments. I think having big tile sizes (e.g., 16x16x16) is not a good idea. We need some plan to codegen it properly. @pashu123 is going to start with unpack ukernel enablement and see where the gaps are.

May 23 '24 16:05 hanhanW

(closing as stale)

Apr 30 '25 00:04 benvanik