sebvince
sebvince
Awesome ! We should catch up to discuss this :) . Happy to look at ATT trace if you have one. for `aux = 3` -> it sets sc0 sc1...
> As a reference, I managed to run the same shape with the assembly kernel using https://github.com/nod-ai/fp4-benchmark > > Assembly kernel e2e latency: **3.61ms** > > If we ignore all...
That makes more sense. Thanks for the clarification @Yu-Zhewen @Muzammiluddin-Syed-ECE !
Looking at the IR, it seems that the LDS tile is 32x128. ``` %14 = iree_codegen.swizzle_hint %alloc[#iree_codegen.xor_shuffle] : memref %expand_shape_0 = memref.expand_shape %14 [[0, 1, 2]] output_shape [32, 128, 32]...