Hongtao Yu comments

Results 53 comments of


                                            Hongtao Yu

Custom tensor roll kernel produces wrong results for variations of block size and tl.constexpr

It looks like the predication is different in the two cases. Disable opt: ``` mov.u32 %r1, %ctaid.x; shl.b32 %r9, %r1, 8; or.b32 %r10, %r9, %r8; shr.s32 %r11, %r10, 1; setp.eq.s32...

Custom tensor roll kernel produces wrong results for variations of block size and tl.constexpr

> Yes, but the problem is that r10 is supposed to be a multiple of 2, so checking it against 0 vs. checking (r10 >> 1) against 0 should be...

Custom tensor roll kernel produces wrong results for variations of block size and tl.constexpr

> Disabling optimization for ptxas also fixed the problem. [def-sass.txt](https://github.com/openai/triton/files/14030529/def-sass.txt) [no-ptxas-opt-sass.txt](https://github.com/openai/triton/files/14030530/no-ptxas-opt-sass.txt) > > Patch to enable debugging: #2995 Thanks for giving it a shot. So the problem went away with...

Segmentation Fault on Mixed Precision Matmul int8xf32

I'm looking at this issue. https://github.com/pytorch/pytorch/issues/122227 sounds similar.

Support L2 cache hint

Thanks for working on this. LGTM. I'm also curious about how this moves perf. Maybe kick off a Pytorch nightly perf run?

Support L2 cache hint

> What is the easiest way to kick off pytorch perf test? Is it changing the pytorch pin to this hash, then start some job from the diff? @htyu Yes....

IndexError: map::at when doing torch.ops.matmul on fp32 matrices

I can take a look.

IndexError: map::at when doing torch.ops.matmul on fp32 matrices

Have you given torch 2.4 a shot? It doesn't repro for me with that.

IndexError: map::at when doing torch.ops.matmul on fp32 matrices

> @htyu you're testing on an sm75 GPU without tensor cores? I see. I'm not. Let me get a machine without tensor cores.

IndexError: map::at when doing torch.ops.matmul on fp32 matrices

Just a heads-up. It looks like a P100 machine is needed. We are running short on those internally and we had toolchain issues (e.g glibc) required by latest LLVM on...