peterbell10 comments

Repositories
Issues
Comments

Results 134 comments of


                                            peterbell10

Why change the order of make_block_ptr when V.dtype.element_ty == tl.float8e5?

The v matrix is used as the b argument to a wgmma instruction. wgmma allows fp16 inputs to be in either row-major or column major format, but for FP8 types...

Racecheck Bug when tl.min used with tl.sum

Out of curiosity I profiled the repro before and after the change I do see a small (~1%) speedup that reproduces consistently.

Index in triton

#5262 adds support for this as ```python output = tl.gather(x, idx, axis=0)

Build fails for Grace Hopper system

`std::reduce` requires C++17. Perhaps your toolchain is too old or is being called with the wrong flags?