Yuan Tong comments

Results 22 comments of


                                            Yuan Tong

Add support for sharp yuv

I did some work on porting sharp yuv to libavif (https://github.com/AOMediaCodec/libavif/pull/444), and I'd like to share some information here. > this could be done with a large matrix inversion Unfortunately...

Major improvement areas and future plans

What do you think about conda for packaging? It knows how to handle Python packages and binary packages (so Script X can list Plugin Y, as well as normal python/binary...

Major improvement areas and future plans

> ana(conda) took 5 minutes to install on a ssd. Well, the install size is over 10GB.... a bit heavy for only installing a simple dll. `conda` itself is notoriously...

Major improvement areas and future plans

> This works much faster and better, thx. But why is x264, fonts, etc a dependency of vapoursynth-bestsource? Or am I interpreting this wrong? Bestsource is dynamically linked to the...

[BACKEND] Add support to convert INT8 MMAV2 accumulator layout to dot_operand layout

> I don't understand, could you give an example of what MMA format is different based on the type? You're indeed right. I was confused as there turns out to...

[BACKEND] Add support to convert INT8 MMAV2 accumulator layout to dot_operand layout

> FAILED hopper/test_gemm.py::test_gemm[128-128-64-4-1-4096-1-1024-False-False-True-none-float32-False-3] - AssertionError: Tensor-likes are not close! > > Mismatched elements: 1 / 4096 (0.0%) > Greatest absolute difference: 2.0 at index (289, 0) (up to 0.001 allowed)...

[BACKEND] Add support to convert INT8 MMAV2 accumulator layout to dot_operand layout

> I see a big performance regression in the fp8 flash attention tutorial with this patch Sorry I don't have access to H100. I made an attempt to fix it,...

Chained INT8 dot produces incorrect result

I tried to disable `ConvertLayoutOpConversion::lowerMmaToDotOperand` pass in `third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM.cpp` and the result is correct. Disable this conversion for INT8 input fixes the two `USE_FP_MATMUL2=False` cases and disable it completely fixed `LOAD_V_TRANSPOSED=False...

Chained INT8 dot produces incorrect result

Managed to modify `ConvertLayoutOpConversion::convertMMAV3To8BitsDotOperand` a little bit to make `USE_FP_MATMUL2=False` case produce correct result. But how can I determine the source matrix is produced by an INT8 MMA (may be...

Chained INT8 dot produces incorrect result

`LOAD_V_TRANSPOSED=False USE_FP_MATMUL2=True` seems to be another issue and can be reproduced even BOTH matmuls are FP16: ```python3 import numpy as np import torch import triton import triton.language as tl @triton.jit...