peterbell10
peterbell10
The v matrix is used as the b argument to a wgmma instruction. wgmma allows fp16 inputs to be in either row-major or column major format, but for FP8 types...
Out of curiosity I profiled the repro before and after the change I do see a small (~1%) speedup that reproduces consistently.
#5262 adds support for this as ```python output = tl.gather(x, idx, axis=0)
`std::reduce` requires C++17. Perhaps your toolchain is too old or is being called with the wrong flags?