triton icon indicating copy to clipboard operation
triton copied to clipboard

Why triton convert to float8e5 will cause local memory read/write

Open MARD1NO opened this issue 1 year ago • 13 comments

I just write a kernel and it contains a x.to(tl.float8e5) , in ncu I found it cause local memory read/store

MARD1NO avatar Sep 20 '24 08:09 MARD1NO

can you provide a simple kernel example?

ThomasRaoux avatar Sep 20 '24 14:09 ThomasRaoux

can you provide a simple kernel example?

Sure,I copy the tutorial gemm kernel as an example, just convert input matrix to tl.float8e5:

import torch

import triton
import triton.language as tl



@triton.jit
def matmul_kernel(
        # Pointers to matrices
        a_ptr, b_ptr, c_ptr,
        # Matrix dimensions
        M, N, K,
        # The stride variables represent how much to increase the ptr by when moving by 1
        # element in a particular dimension. E.g. `stride_am` is how much to increase `a_ptr`
        # by to get the element one row down (A has M rows).
        stride_am, stride_ak,  #
        stride_bk, stride_bn,  #
        stride_cm, stride_cn,
        # Meta-parameters
        BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,  #
        GROUP_SIZE_M: tl.constexpr,  #
):
    """Kernel for computing the matmul C = A x B.
    A has shape (M, K), B has shape (K, N) and C has shape (M, N)
    """
    # -----------------------------------------------------------
    # Map program ids `pid` to the block of C it should compute.
    # This is done in a grouped ordering to promote L2 data reuse.
    # See above `L2 Cache Optimizations` section for details.
    pid = tl.program_id(axis=0)
    num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
    num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
    num_pid_in_group = GROUP_SIZE_M * num_pid_n
    group_id = pid // num_pid_in_group
    first_pid_m = group_id * GROUP_SIZE_M
    group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M)
    pid_m = first_pid_m + ((pid % num_pid_in_group) % group_size_m)
    pid_n = (pid % num_pid_in_group) // group_size_m

    # ----------------------------------------------------------
    # Create pointers for the first blocks of A and B.
    # We will advance this pointer as we move in the K direction
    # and accumulate
    # `a_ptrs` is a block of [BLOCK_SIZE_M, BLOCK_SIZE_K] pointers
    # `b_ptrs` is a block of [BLOCK_SIZE_K, BLOCK_SIZE_N] pointers
    # See above `Pointer Arithmetic` section for details
    offs_am = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
    offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N
    offs_k = tl.arange(0, BLOCK_SIZE_K)
    a_ptrs = a_ptr + (offs_am[:, None] * stride_am + offs_k[None, :] * stride_ak)
    b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_bn[None, :] * stride_bn)

    # -----------------------------------------------------------
    # Iterate to compute a block of the C matrix.
    # We accumulate into a `[BLOCK_SIZE_M, BLOCK_SIZE_N]` block
    # of fp32 values for higher accuracy.
    # `accumulator` will be converted back to fp16 after the loop.
    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
    for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
        # Load the next block of A and B, generate a mask by checking the K dimension.
        # If it is out of bounds, set it to 0.
        a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0)
        b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0)
        
        a = a.to(tl.float8e5)
        b = b.to(tl.float8e5)

        # We accumulate along the K dimension.
        accumulator = tl.dot(a, b, accumulator)
        # Advance the ptrs to the next K block.
        a_ptrs += BLOCK_SIZE_K * stride_ak
        b_ptrs += BLOCK_SIZE_K * stride_bk
    # You can fuse arbitrary activation functions here
    # while the accumulator is still in FP32!
    c = accumulator.to(tl.float16)

    # -----------------------------------------------------------
    # Write back the block of the output matrix C with masks.
    offs_cm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
    offs_cn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
    c_ptrs = c_ptr + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]
    c_mask = (offs_cm[:, None] < M) & (offs_cn[None, :] < N)
    tl.store(c_ptrs, c, mask=c_mask)



def matmul(a, b):
    # Check constraints.
    assert a.shape[1] == b.shape[0], "Incompatible dimensions"
    assert a.is_contiguous(), "Matrix A must be contiguous"
    M, K = a.shape
    K, N = b.shape
    # Allocates output.
    c = torch.empty((M, N), device=a.device, dtype=torch.float16)
    # 1D launch kernel where each block gets its own program.

    BLOCK_SIZE_M = 64
    BLOCK_SIZE_N = 128
    BLOCK_SIZE_K = 64  #
    GROUP_SIZE_M = 8

    grid = lambda META: (triton.cdiv(M, BLOCK_SIZE_M) * triton.cdiv(N, BLOCK_SIZE_N), )
    matmul_kernel[grid](
        a, b, c,  #
        M, N, K,  #
        a.stride(0), a.stride(1),  #
        b.stride(0), b.stride(1),  #
        c.stride(0), c.stride(1),  #
        BLOCK_SIZE_M, 
        BLOCK_SIZE_N, 
        BLOCK_SIZE_K, 
        GROUP_SIZE_M
    )
    return c

M = 16
K = 256
N = 256

a = torch.randn((M, K), device='cuda', dtype=torch.float16)
b = torch.randn((K, N), device='cuda', dtype=torch.float16)
b = b.permute(0, 1).contiguous()
b = b.permute(0, 1)


matmul(a, b)

I profile it in Nsight compute tool, it seems the datatype conversion related to the local memory: image

image

I test it in h20

MARD1NO avatar Sep 21 '24 03:09 MARD1NO

Likely a register spilling problem

Jokeren avatar Sep 21 '24 03:09 Jokeren

Check your register usage of this kernel

Jokeren avatar Sep 21 '24 03:09 Jokeren

Check your register usage of this kernel

I don't think the problem is register spilling, nsight compute shows the register use only 168: image

maybe the conversion use some non-constant index in array?... image

MARD1NO avatar Sep 21 '24 03:09 MARD1NO

this PR might fix it. Can you try: https://github.com/triton-lang/triton/pull/4776

ThomasRaoux avatar Sep 21 '24 03:09 ThomasRaoux

this PR might fix it. Can you try: #4776

Thanks Thomas, I will try it :D

MARD1NO avatar Sep 21 '24 03:09 MARD1NO

this PR might fix it. Can you try: #4776

this commit still has local memory write

MARD1NO avatar Sep 21 '24 05:09 MARD1NO

It has to do with cvt.rn.satfinite.e5m2x2.f16x2. Taking a look now

Jokeren avatar Sep 21 '24 19:09 Jokeren

It has to do with cvt.rn.satfinite.e5m2x2.f16x2. Taking a look now

yes, related SASS and ptx shows it use cvt.rn.satfinite.e5m2x2.f16x2

image image

it seems the store local happens after convert to e5m2 from PTX image

MARD1NO avatar Sep 22 '24 02:09 MARD1NO

IMO it's probably caused by nvptx doesn't handle 8-bit vector type well. Let me trigger a discussion and get back to you

Jokeren avatar Sep 22 '24 13:09 Jokeren

Update: @ThomasRaoux has a workaround now, and we will probably land his code after he is back from vacation.

Jokeren avatar Sep 24 '24 13:09 Jokeren

Update: @ThomasRaoux has a workaround now, and we will probably land his code after he is back from vacation.

Thanks!

MARD1NO avatar Sep 24 '24 13:09 MARD1NO

Update: @ThomasRaoux has a workaround now, and we will probably land his code after he is back from vacation.

Hi keren, has this code been updated on main branch ?

MARD1NO avatar Oct 24 '24 01:10 MARD1NO

Probably yes? I'm not 100% sure

Jokeren avatar Oct 24 '24 01:10 Jokeren

It isn't fixed yet, but will be fixed by llvm/llvm-project#113928

peterbell10 avatar Oct 31 '24 18:10 peterbell10

Good to know. Thanks!

Jokeren avatar Oct 31 '24 19:10 Jokeren