Hongtao Yu

Results 53 comments of Hongtao Yu

It looks like the predication is different in the two cases. Disable opt: ``` mov.u32 %r1, %ctaid.x; shl.b32 %r9, %r1, 8; or.b32 %r10, %r9, %r8; shr.s32 %r11, %r10, 1; setp.eq.s32...

> Yes, but the problem is that r10 is supposed to be a multiple of 2, so checking it against 0 vs. checking (r10 >> 1) against 0 should be...

> Disabling optimization for ptxas also fixed the problem. [def-sass.txt](https://github.com/openai/triton/files/14030529/def-sass.txt) [no-ptxas-opt-sass.txt](https://github.com/openai/triton/files/14030530/no-ptxas-opt-sass.txt) > > Patch to enable debugging: #2995 Thanks for giving it a shot. So the problem went away with...

I'm looking at this issue. https://github.com/pytorch/pytorch/issues/122227 sounds similar.

Thanks for working on this. LGTM. I'm also curious about how this moves perf. Maybe kick off a Pytorch nightly perf run?

> What is the easiest way to kick off pytorch perf test? Is it changing the pytorch pin to this hash, then start some job from the diff? @htyu Yes....

Have you given torch 2.4 a shot? It doesn't repro for me with that.

> @htyu you're testing on an sm75 GPU without tensor cores? I see. I'm not. Let me get a machine without tensor cores.

Just a heads-up. It looks like a P100 machine is needed. We are running short on those internally and we had toolchain issues (e.g glibc) required by latest LLVM on...