Puyan Lotfi comments

Results 28 comments of


                                            Puyan Lotfi

[cxx-interop] C++ records should have address-only layout when they can't be passed in registers

I will take a look, try this patch out and give my review soon. Thanks for looking at this @hyp!

`tl.reshape` causes `[CUDA]: misaligned address`

I have some findings on this. **1.** At https://github.com/openai/triton/blob/0327b9d32db6d1d63d207ccab722bd45e00a6678/python/src/llvm.cc#L173 Triton is enabling the SLPVectorizer with an empty target machine, in order to get wider vectors. This results in sequences of...

Large performance regression for FP8 E4M3 GEMM with `triton==2.3`

It seems the change to maxNumImpreciseAcc from https://github.com/openai/triton/pull/2804 brings the run time for matmuls back to 2.2.x levels.

[BUILD] Add setup.py cmake options for lld standalone or mold for linking

This is just a draft PR, not sure what folks think about optionally enabling mold/lld standalone (with gcc still as the C/CXX cmake compiler).

[WIP] TMA Version of HSTU (Autotuned)

@manman-ren @embg

[WIP] TMA Version of HSTU (Autotuned)

@manman-ren Updated, let me know what you think. Will try and get this running with the OSS benchmark launcher.

[WIP] TMA Version of HSTU (Autotuned)

> looks good! Thanks went ahead and cleaned up the autotuning setup. I also got a test launcher running but I am not sure if it is doing things correctly

Feature Request: `tl.atomic_add` for bfloat16

I started work on this one, some preliminaries are at: https://github.com/plotfi/triton/commit/a9d3ce59cfddc9917438727e4df8969bef46b597 One thing to note is atomicAdd with bfloat16 is only supported on Hopper (sm_90). The cuda library's atomicAdd does...

Segmentation fault in triton==3.0.0

Looks like the insert_element in MMA16816SmemLoader::loadX4 is trying to insert at index 32 when it only has a vector of 4 elements when lowering the following: ``` %72 = triton_gpu.local_load...

Segmentation fault in triton==3.0.0

Crash is happening here: https://github.com/triton-lang/triton/blob/main/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp#L415-L420 It is crashing because the canonWidth is 32 which goes out of the bounds of the retElems SmallVector that contains the 4 elements for the...