Puyan Lotfi
Puyan Lotfi
I will take a look, try this patch out and give my review soon. Thanks for looking at this @hyp!
I have some findings on this. **1.** At https://github.com/openai/triton/blob/0327b9d32db6d1d63d207ccab722bd45e00a6678/python/src/llvm.cc#L173 Triton is enabling the SLPVectorizer with an empty target machine, in order to get wider vectors. This results in sequences of...
It seems the change to maxNumImpreciseAcc from https://github.com/openai/triton/pull/2804 brings the run time for matmuls back to 2.2.x levels.
This is just a draft PR, not sure what folks think about optionally enabling mold/lld standalone (with gcc still as the C/CXX cmake compiler).
@manman-ren @embg
@manman-ren Updated, let me know what you think. Will try and get this running with the OSS benchmark launcher.
> looks good! Thanks went ahead and cleaned up the autotuning setup. I also got a test launcher running but I am not sure if it is doing things correctly
I started work on this one, some preliminaries are at: https://github.com/plotfi/triton/commit/a9d3ce59cfddc9917438727e4df8969bef46b597 One thing to note is atomicAdd with bfloat16 is only supported on Hopper (sm_90). The cuda library's atomicAdd does...
Looks like the insert_element in MMA16816SmemLoader::loadX4 is trying to insert at index 32 when it only has a vector of 4 elements when lowering the following: ``` %72 = triton_gpu.local_load...
Crash is happening here: https://github.com/triton-lang/triton/blob/main/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp#L415-L420 It is crashing because the canonWidth is 32 which goes out of the bounds of the retElems SmallVector that contains the 4 elements for the...