tilelang icon indicating copy to clipboard operation
tilelang copied to clipboard

Problem of convolution example.

Open jiangwenj02 opened this issue 9 months ago • 19 comments

Hi, I run the example of convolution on Titan RTX, but i meet the following issues? how can i address it. error: no instance of overloaded function "atomicAdd" matches the argument list argument types are: (__nv_bfloat162 *, __nv_bfloat162) atomicAdd( Thank you.

jiangwenj02 avatar Mar 25 '25 09:03 jiangwenj02

Surprised by this error message, as the convolution examples didn't use atomicAdd, which tilelang version are you woking on?

LeiWang1999 avatar Mar 25 '25 10:03 LeiWang1999

the tilelang version is 0.1.3.

The following is the errors.

/root/anaconda3/envs/openseed/lib/python3.10/site-packages/tilelang/src/tl_templates/cuda/common.h(122): error: no instance of overloaded function "atomicAdd" matches the argument list argument types are: (__nv_bfloat16 *, __nv_bfloat16) atomicAdd(reinterpret_cast<__nv_bfloat16 *>(address), ^

/root/anaconda3/envs/openseed/lib/python3.10/site-packages/tilelang/src/tl_templates/cuda/common.h(128): error: no instance of overloaded function "atomicAdd" matches the argument list argument types are: (__nv_bfloat16 *, __nv_bfloat16) atomicAdd(reinterpret_cast<__nv_bfloat16 *>(address), __float2bfloat16(val)); ^

/root/anaconda3/envs/openseed/lib/python3.10/site-packages/tilelang/src/tl_templates/cuda/common.h(133): error: no instance of overloaded function "atomicAdd" matches the argument list argument types are: (__nv_bfloat16 *, __nv_bfloat16) atomicAdd(reinterpret_cast<__nv_bfloat16 *>(address), ^

/root/anaconda3/envs/openseed/lib/python3.10/site-packages/tilelang/src/tl_templates/cuda/common.h(145): error: no instance of overloaded function "atomicAdd" matches the argument list argument types are: (__nv_bfloat162 *, __nv_bfloat162) atomicAdd( ^

The following is the generated cuda code.

#include <tl_templates/cuda/gemm.h> #include <tl_templates/cuda/copy.h> #include <tl_templates/cuda/reduce.h> #include <tl_templates/cuda/ldsm.h> #include <tl_templates/cuda/threadblock_swizzle.h> #include <tl_templates/cuda/debug.h>

extern "C" global void main_kernel(half_t* restrict data, half_t* restrict kernel_flat, half_t* restrict out_flat); extern "C" global void launch_bounds(256, 1) main_kernel(half_t* restrict data, half_t* restrict kernel_flat, half_t* restrict out_flat) { extern shared align(1024) uchar buf_dyn_shmem[]; float out_local[128]; #pragma unroll for (int i = 0; i < 64; ++i) { (float2)(out_local + (i * 2)) = make_float2(0.000000e+00f, 0.000000e+00f); } #pragma unroll for (int i_1 = 0; i_1 < 64; ++i_1) { half_t condval; if (((1 <= (((((int)blockIdx.y) & 15) * 4) + (i_1 >> 4))) && (1 <= ((((i_1 & 15) >> 3) * 32) + (((int)threadIdx.x) >> 3))))) { condval = data[((((((((int)blockIdx.y) * 32768) + ((i_1 >> 3) * 4096)) + ((((int)threadIdx.x) >> 3) * 128)) + ((((int)threadIdx.x) & 7) * 8)) + (i_1 & 7)) - 8320)]; } else { condval = half_t(0.000000e+00f); } ((half_t*)buf_dyn_shmem)[((((((((i_1 >> 3) * 2048) + ((((int)threadIdx.x) >> 3) * 64)) + (((((((int)threadIdx.x) & 63) >> 5) + ((((int)threadIdx.x) & 7) >> 2)) & 1) * 32)) + (((((((int)threadIdx.x) & 31) >> 4) + ((((int)threadIdx.x) & 3) >> 1)) & 1) * 16)) + (((((((int)threadIdx.x) & 15) >> 3) + (((int)threadIdx.x) & 1)) & 1) * 8)) + (i_1 & 7)) + 32768)] = condval; } #pragma unroll for (int i_2 = 0; i_2 < 4; ++i_2) { (uint4)(((half_t*)buf_dyn_shmem) + ((((((((((int)threadIdx.x) & 15) >> 3) * 4096) + (i_2 * 1024)) + ((((int)threadIdx.x) >> 4) * 64)) + (((((((int)threadIdx.x) & 127) >> 6) + ((((int)threadIdx.x) & 7) >> 2)) & 1) * 32)) + (((((((int)threadIdx.x) & 63) >> 5) + ((((int)threadIdx.x) & 3) >> 1)) & 1) * 16)) + (((((((int)threadIdx.x) & 31) >> 4) + (((int)threadIdx.x) & 1)) & 1) * 8))) = (uint4)(kernel_flat + ((i_2 * 2048) + (((int)threadIdx.x) * 8))); } #pragma unroll for (int i_3 = 0; i_3 < 64; ++i_3) { half_t condval_1; if (((1 <= (((((int)blockIdx.y) & 15) * 4) + (i_3 >> 4))) && (1 <= ((((i_3 & 15) >> 3) * 32) + (((int)threadIdx.x) >> 3))))) { condval_1 = data[((((((((int)blockIdx.y) * 32768) + ((i_3 >> 3) * 4096)) + ((((int)threadIdx.x) >> 3) * 128)) + ((((int)threadIdx.x) & 7) * 8)) + (i_3 & 7)) - 8256)]; } else { condval_1 = half_t(0.000000e+00f); } ((half_t*)buf_dyn_shmem)[((((((((i_3 >> 3) * 2048) + ((((int)threadIdx.x) >> 3) * 64)) + (((((((int)threadIdx.x) & 63) >> 5) + ((((int)threadIdx.x) & 7) >> 2)) & 1) * 32)) + (((((((int)threadIdx.x) & 31) >> 4) + ((((int)threadIdx.x) & 3) >> 1)) & 1) * 16)) + (((((((int)threadIdx.x) & 15) >> 3) + (((int)threadIdx.x) & 1)) & 1) * 8)) + (i_3 & 7)) + 49152)] = condval_1; } #pragma unroll for (int i_4 = 0; i_4 < 4; ++i_4) { (uint4)(((half_t*)buf_dyn_shmem) + (((((((((((int)threadIdx.x) & 15) >> 3) * 4096) + (i_4 * 1024)) + ((((int)threadIdx.x) >> 4) * 64)) + (((((((int)threadIdx.x) & 127) >> 6) + ((((int)threadIdx.x) & 7) >> 2)) & 1) * 32)) + (((((((int)threadIdx.x) & 63) >> 5) + ((((int)threadIdx.x) & 3) >> 1)) & 1) * 16)) + (((((((int)threadIdx.x) & 31) >> 4) + (((int)threadIdx.x) & 1)) & 1) * 8)) + 8192)) = (uint4)(kernel_flat + (((i_4 * 2048) + (((int)threadIdx.x) * 8)) + 8192)); } #pragma unroll for (int i_5 = 0; i_5 < 64; ++i_5) { half_t condval_2; if ((1 <= (((((int)blockIdx.y) & 15) * 4) + (i_5 >> 4)))) { condval_2 = data[((((((((int)blockIdx.y) * 32768) + ((i_5 >> 3) * 4096)) + ((((int)threadIdx.x) >> 3) * 128)) + ((((int)threadIdx.x) & 7) * 8)) + (i_5 & 7)) - 8192)]; } else { condval_2 = half_t(0.000000e+00f); } ((half_t*)buf_dyn_shmem)[((((((((i_5 >> 3) * 2048) + ((((int)threadIdx.x) >> 3) * 64)) + (((((((int)threadIdx.x) & 63) >> 5) + ((((int)threadIdx.x) & 7) >> 2)) & 1) * 32)) + (((((((int)threadIdx.x) & 31) >> 4) + ((((int)threadIdx.x) & 3) >> 1)) & 1) * 16)) + (((((((int)threadIdx.x) & 15) >> 3) + (((int)threadIdx.x) & 1)) & 1) * 8)) + (i_5 & 7)) + 65536)] = condval_2; } #pragma unroll for (int i_6 = 0; i_6 < 4; ++i_6) { (uint4)(((half_t*)buf_dyn_shmem) + (((((((((((int)threadIdx.x) & 15) >> 3) * 4096) + (i_6 * 1024)) + ((((int)threadIdx.x) >> 4) * 64)) + (((((((int)threadIdx.x) & 127) >> 6) + ((((int)threadIdx.x) & 7) >> 2)) & 1) * 32)) + (((((((int)threadIdx.x) & 63) >> 5) + ((((int)threadIdx.x) & 3) >> 1)) & 1) * 16)) + (((((((int)threadIdx.x) & 31) >> 4) + (((int)threadIdx.x) & 1)) & 1) * 8)) + 16384)) = (uint4)(kernel_flat + (((i_6 * 2048) + (((int)threadIdx.x) * 8)) + 16384)); } for (int k_iter = 0; k_iter < 15; ++k_iter) { __syncthreads(); #pragma unroll for (int i_7 = 0; i_7 < 64; ++i_7) { half_t condval_3; if (((((((((((i_7 & 15) >> 3) * 32) + (((int)threadIdx.x) >> 3)) + (((k_iter + 3) % 6) >> 1)) < 65) && (((((((int)blockIdx.y) & 15) * 4) + ((k_iter + 3) / 6)) + (i_7 >> 4)) < 65)) && (1 <= ((((((int)blockIdx.y) & 15) * 4) + ((k_iter + 3) / 6)) + (i_7 >> 4)))) && (1 <= (((((i_7 & 15) >> 3) * 32) + (((int)threadIdx.x) >> 3)) + (((k_iter + 3) % 6) >> 1)))) && (((((((int)blockIdx.y) & 15) * 4) + ((k_iter + 3) / 6)) + (i_7 >> 4)) < 65))) { condval_3 = data[(((((((((((int)blockIdx.y) * 32768) + (((k_iter + 3) / 6) * 8192)) + ((i_7 >> 3) * 4096)) + ((((int)threadIdx.x) >> 3) * 128)) + ((((k_iter + 3) % 6) >> 1) * 128)) + (((k_iter + 1) & 1) * 64)) + ((((int)threadIdx.x) & 7) * 8)) + (i_7 & 7)) - 8320)]; } else { condval_3 = half_t(0.000000e+00f); } ((half_t*)buf_dyn_shmem)[((((((((((k_iter + 3) & 3) * 16384) + ((i_7 >> 3) * 2048)) + ((((int)threadIdx.x) >> 3) * 64)) + (((((((int)threadIdx.x) & 63) >> 5) + ((((int)threadIdx.x) & 7) >> 2)) & 1) * 32)) + (((((((int)threadIdx.x) & 31) >> 4) + ((((int)threadIdx.x) & 3) >> 1)) & 1) * 16)) + (((((((int)threadIdx.x) & 15) >> 3) + (((int)threadIdx.x) & 1)) & 1) * 8)) + (i_7 & 7)) + 32768)] = condval_3; } #pragma unroll for (int i_8 = 0; i_8 < 4; ++i_8) { (uint4)(((half_t*)buf_dyn_shmem) + (((((((((k_iter + 3) & 3) * 8192) + (((((int)threadIdx.x) & 15) >> 3) * 4096)) + (i_8 * 1024)) + ((((int)threadIdx.x) >> 4) * 64)) + (((((((int)threadIdx.x) & 127) >> 6) + ((((int)threadIdx.x) & 7) >> 2)) & 1) * 32)) + (((((((int)threadIdx.x) & 63) >> 5) + ((((int)threadIdx.x) & 3) >> 1)) & 1) * 16)) + (((((((int)threadIdx.x) & 31) >> 4) + (((int)threadIdx.x) & 1)) & 1) * 8))) = (uint4)(kernel_flat + ((((k_iter * 8192) + (i_8 * 2048)) + (((int)threadIdx.x) * 8)) + 24576)); } __syncthreads(); tl::gemm_ss<256, 128, 64, 4, 2, 0, 0, 0>((&(((half_t*)buf_dyn_shmem)[(((k_iter & 3) * 16384) + 32768)])), (&(((half_t*)buf_dyn_shmem)[((k_iter & 3) * 8192)])), (&(out_local[0]))); } tl::gemm_ss<256, 128, 64, 4, 2, 0, 0, 0>((&(((half_t*)buf_dyn_shmem)[81920])), (&(((half_t*)buf_dyn_shmem)[24576])), (&(out_local[0]))); tl::gemm_ss<256, 128, 64, 4, 2, 0, 0, 0>((&(((half_t*)buf_dyn_shmem)[32768])), (&(((half_t*)buf_dyn_shmem)[0])), (&(out_local[0]))); tl::gemm_ss<256, 128, 64, 4, 2, 0, 0, 0>((&(((half_t*)buf_dyn_shmem)[49152])), (&(((half_t*)buf_dyn_shmem)[8192])), (&(out_local[0]))); #pragma unroll for (int i_9 = 0; i_9 < 64; ++i_9) { for (int vec = 0; vec < 2; ++vec) { ((half_t*)buf_dyn_shmem)[((((((((((((((((i_9 >> 3) * 16) + ((((int)threadIdx.x) >> 7) * 8)) + ((((int)threadIdx.x) & 3) * 2)) + vec) >> 6) * 16384) + (((i_9 & 7) >> 1) * 4096)) + (((((int)threadIdx.x) & 127) >> 5) * 1024)) + ((i_9 & 1) * 512)) + (((((int)threadIdx.x) & 31) >> 2) * 64)) + ((((((((((i_9 >> 3) * 16) + ((((int)threadIdx.x) >> 7) * 8)) + ((((int)threadIdx.x) & 3) * 2)) + vec) & 63) >> 5) + ((((int)threadIdx.x) & 31) >> 4)) & 1) * 32)) + (((((i_9 & 15) >> 3) + ((((int)threadIdx.x) & 15) >> 3)) & 1) * 16)) + ((((((int)threadIdx.x) >> 7) + ((((int)threadIdx.x) & 7) >> 2)) & 1) * 8)) + ((((int)threadIdx.x) & 3) * 2)) + vec)] = ((half_t)out_local[((i_9 * 2) + vec)]); } } __syncthreads(); #pragma unroll for (int i_10 = 0; i_10 < 16; ++i_10) { (uint4)(out_flat + (((((int)blockIdx.y) * 32768) + (i_10 * 2048)) + (((int)threadIdx.x) * 8))) = (uint4)(((half_t*)buf_dyn_shmem) + ((((((((((int)threadIdx.x) & 15) >> 3) * 16384) + (i_10 * 1024)) + ((((int)threadIdx.x) >> 4) * 64)) + (((((((int)threadIdx.x) & 127) >> 6) + ((((int)threadIdx.x) & 7) >> 2)) & 1) * 32)) + (((((((int)threadIdx.x) & 63) >> 5) + ((((int)threadIdx.x) & 3) >> 1)) & 1) * 16)) + (((((((int)threadIdx.x) & 31) >> 4) + (((int)threadIdx.x) & 1)) & 1) * 8))); } }

#define ERROR_BUF_SIZE 1024 static char error_buf[ERROR_BUF_SIZE];

extern "C" const char* get_last_error() { return error_buf; }

extern "C" int init() { error_buf[0] = '\0';

cudaError_t result = cudaFuncSetAttribute(main_kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 196608);
if (result != CUDA_SUCCESS) {
    snprintf(error_buf, ERROR_BUF_SIZE, "Failed to set the allowed dynamic shared memory size to %d with error: %s", 196608, cudaGetErrorString(result));
    return -1;
}

return 0;

}

extern "C" int call(half_t* restrict data, half_t* restrict kernel_flat, half_t* restrict out_flat, cudaStream_t stream=cudaStreamDefault) { main_kernel<<<dim3(1, 2048, 1), dim3(256, 1, 1), 196608, stream>>>(data, kernel_flat, out_flat);

return 0; }

jiangwenj02 avatar Mar 26 '25 05:03 jiangwenj02

@jiangwenj02 looks interesting, It might be caused by an old CUDA version (that doesn't support atomic with bfloat16). Would you mind sharing your device information and CUDA version?

LeiWang1999 avatar Mar 26 '25 12:03 LeiWang1999

@LeiWang1999 Sadly, my CUDA version is 12.1. Is it possible that the RTX Titan doesn't support the bfloat16 data type?

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Feb__7_19:32:13_PST_2023 Cuda compilation tools, release 12.1, V12.1.66 Build cuda_12.1.r12.1/compiler.32415258_0

jiangwenj02 avatar Mar 27 '25 01:03 jiangwenj02

@jiangwenj02 We should determine whether the limitation is caused by the SM version or the CUDA version. Once we know, we can add a C macro to disable atomics with __nv_bfloat16.

LeiWang1999 avatar Mar 27 '25 02:03 LeiWang1999

I'm also interest in whether this can be reproduced with the quick start gemm example

LeiWang1999 avatar Mar 27 '25 02:03 LeiWang1999

@LeiWang1999 To reproduce, I simply ran the examples without modifications. Here are my environment details:

sys.platform: linux Python: 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6: NVIDIA TITAN RTX CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.1, V12.1.66 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~18.04) 9.4.0 PyTorch: 2.4.1+cu121 PyTorch compiling details: PyTorch built with:

  • GCC 9.3
  • C++ Version: 201703
  • Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 12.1
  • NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  • CuDNN 90.1 (built against CUDA 12.4)
  • Magma 2.6.1
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

jiangwenj02 avatar Mar 27 '25 05:03 jiangwenj02

I made a fix, you can checkout the nightly build tonight or build from source #291

LeiWang1999 avatar Mar 27 '25 05:03 LeiWang1999

  1. We've noticed a discrepancy where the convolution example continues to use keys as arguments, while the autotune function implementation no longer appears to support this parameter.

  2. We found a few more issues in the modified common.h. Here are the fixes:

// Copyright (c) Tile-AI Corporation. // Licensed under the MIT License. #pragma once

#include <cuda_runtime.h> #include <cutlass/fast_math.h> #include <cutlass/numeric_types.h> #include <math_constants.h>

using cutlass::bfloat16_t; using cutlass::half_t; using cutlass::tfloat32_t;

using int4_t = int4;

#define hexp cutlass::fast_exp #define hlog cutlass::fast_log #define hsqrt cutlass::fast_sqrt #define htanh cutlass::fast_tanh #define hpow powf

#define uint unsigned int #define uchar unsigned char #define ushort unsigned short

#define TL_DEVICE forceinline device #define TL_DEVICE_NOINLINE noinline device

// Pack two half values. TL_DEVICE unsigned __pack_half2(const half x, const half y) { unsigned v0 = *((unsigned short *)&x); unsigned v1 = *((unsigned short *)&y); return (v1 << 16) | v0; }

// Pack two half_t values. TL_DEVICE unsigned __pack_half2(const half_t x, const half_t y) { unsigned v0 = *((unsigned short *)&x); unsigned v1 = *((unsigned short *)&y); return (v1 << 16) | v0; }

// Pack two bfloat16_t values. TL_DEVICE unsigned __pack_half2(const bfloat16_t x, const bfloat16_t y) { unsigned v0 = *((unsigned short *)&x); unsigned v1 = *((unsigned short *)&y); return (v1 << 16) | v0; }

// Pack two bfloat16_t values. TL_DEVICE unsigned __pack_nv_bfloat162(const bfloat16_t x, const bfloat16_t y) { unsigned v0 = *((unsigned short *)&x); unsigned v1 = *((unsigned short *)&y); return (v1 << 16) | v0; }

// Pack four char values TL_DEVICE int make_int(signed char x0, signed char x1, signed char x2, signed char x3) { return (x3 << 24) | (x2 << 16) | (x1 << 8) | x0; }

// Pack sixteen char values. TL_DEVICE int4_t make_int4(signed char x0, signed char x1, signed char x2, signed char x3, signed char y0, signed char y1, signed char y2, signed char y3, signed char z0, signed char z1, signed char z2, signed char z3, signed char w0, signed char w1, signed char w2, signed char w3) { int4_t result; result.x = make_int(x0, x1, x2, x3); result.y = make_int(y0, y1, y2, y3); result.z = make_int(z0, z1, z2, z3); result.w = make_int(w0, w1, w2, w3); return result; }

// Helper to cast SMEM pointer to unsigned TL_DEVICE uint32_t smem_ptr_to_uint(void const *const ptr) { return static_cast<uint32_t>(__cvta_generic_to_shared(ptr)); }

// Helper to cast SMEM pointer to unsigned TL_DEVICE unsigned int cast_smem_ptr_to_int(const void *const smem_ptr) { unsigned int smem_int; asm volatile("{ .reg .u64 smem_int; cvta.to.shared.u64 smem_int, %1; " "cvt.u32.u64 %0, smem_int; }" : "=r"(smem_int) : "l"(smem_ptr)); return smem_int; }

template <typename T1, typename T2> TL_DEVICE void AtomicAdd(T1 *address, T2 val) { atomicAdd(reinterpret_cast<T1 *>(address), static_cast<T1>(val)); }

// // AtomicAdd Functions for FP32 // TL_DEVICE void AtomicAdd(float *address, float val) { // atomicAdd(reinterpret_cast<float *>(address), val); // }

// AtomicAdd Functions for FP16 template <> TL_DEVICE void AtomicAdd(half_t *address, half_t val) { // Use atomicCAS with built-in cuda_fp16 support atomicAdd(reinterpret_cast<half *>(address), static_cast(val)); }

// AtomicAdd Functions for FP16 template <> TL_DEVICE void AtomicAdd(half_t *address, half_t *val) { atomicAdd(reinterpret_cast<half *>(address), static_cast(*val)); }

// AtomicAdd Functions for FP16 template <> TL_DEVICE void AtomicAdd(half_t *address, float val) { // Use atomicCAS with built-in cuda_fp16 support atomicAdd(reinterpret_cast<half *>(address), __float2half(val)); }

#if (defined(CUDA_ARCH_LIST) && (CUDA_ARCH_LIST > 750))

// AtomicAdd Functions for BFLOAT16 template <> TL_DEVICE void AtomicAdd(bfloat16_t *address, bfloat16_t *val) { atomicAdd(reinterpret_cast<__nv_bfloat16 *>(address), static_cast<__nv_bfloat16>(*val)); }

#endif

#if (defined(CUDA_ARCH_LIST) && (CUDA_ARCH_LIST > 750))

// AtomicAdd Functions for BFLOAT16 template <> TL_DEVICE void AtomicAdd(bfloat16_t *address, float val) { atomicAdd(reinterpret_cast<__nv_bfloat16 *>(address), __float2bfloat16(val)); }

#endif

// AtomicAdd Functions for FP16x2 TL_DEVICE void AtomicAddx2(half_t *address, half_t *val) { atomicAdd(reinterpret_cast<half2 *>(address), static_cast(*reinterpret_cast<half2 *>(val))); }

#if (defined(CUDA_ARCH_LIST) && (CUDA_ARCH_LIST > 750))

// AtomicAdd Functions for BFLOAT16 template <> TL_DEVICE void AtomicAdd(bfloat16_t *address, bfloat16_t val) { atomicAdd(reinterpret_cast<__nv_bfloat16 *>(address), static_cast<__nv_bfloat16>(val)); }

// AtomicAdd Functions for BFLOAT16x2 TL_DEVICE void AtomicAddx2(bfloat16_t *address, bfloat16_t *val) { atomicAdd( reinterpret_cast<__nv_bfloat162 *>(address), static_cast<__nv_bfloat162>(*reinterpret_cast<__nv_bfloat162 *>(val))); }

#endif // DP4A template <typename InDatatype, typename OutDatatype> TL_DEVICE void DP4A(InDatatype *a, InDatatype *b, OutDatatype *c) { const int a_int = *((int *)a); const int b_int = *((int *)b); const int c_int = *((int *)c); *c = __dp4a(a_int, b_int, c_int); }

jiangwenj02 avatar Mar 28 '25 03:03 jiangwenj02

Yes, a similar fix using #if (defined(CUDA_ARCH_LIST) && (CUDA_ARCH_LIST > 750)) was applied in #291

You can install the nightly build with:

pip install tilelang -f https://tile-ai.github.io/whl/nightly/cu121/
# or
pip install tilelang --find-links https://tile-ai.github.io/whl/nightly/cu121/

Note: We’ve refactored the autotune system, but the example code hasn’t been updated to use the new autotune flow yet.

would be great if you can send a pull request to fix :), if not we can have sb to offer some helps.

LeiWang1999 avatar Mar 28 '25 03:03 LeiWang1999

@LeiWang1999 I have tested PR #291 , previous code is modified on it:

  1. The comparison operator should be changed from >= to >
  2. #291 need more two #if (defined(CUDA_ARCH_LIST) && (CUDA_ARCH_LIST > 750))

jiangwenj02 avatar Mar 28 '25 03:03 jiangwenj02

I see, thanks for your reporting! @jiangwenj02

LeiWang1999 avatar Mar 28 '25 03:03 LeiWang1999

@LeiWang1999 Hi, I tried different configurations of block_M, block_N, and block_K. I found that some configurations produced correct results, while others were wrong. However, when I looked at the code, I couldn't identify the error.

For example, an incorrect configuration:

program = convolution(  
    N, C, H, W, F, K, S, D, P, tune=args.tune)(  
        block_M=64, block_N=64, block_K=64, num_stages=4, threads=128)  

jiangwenj02 avatar Mar 28 '25 09:03 jiangwenj02

cc @Cunxiao2002 will take a look :)

LeiWang1999 avatar Mar 28 '25 16:03 LeiWang1999

@jiangwenj02 Hi, can you provide more information about your examples and environment? I can't reproduce this error on A100.

Cunxiao2002 avatar Mar 28 '25 16:03 Cunxiao2002

Also, are you using the latest nightly build? It’s possible that your stage numbers are too large, which could lead to insufficient shared memory. In the latest nightly build, this would trigger an error

LeiWang1999 avatar Mar 28 '25 16:03 LeiWang1999

@Cunxiao2002 @LeiWang1999 I tried to reduce the num_stage to 1, it still have mismatched elements .

Traceback (most recent call last):
  File "/root/resattn/dyconv/others/test_tilelang_v2.py", line 244, in <module>
    profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
  File "/root/anaconda3/envs/openseed/lib/python3.10/site-packages/tilelang/profiler/__init__.py", line 107, in assert_allclose
    torch_assert_close(
  File "/root/anaconda3/envs/openseed/lib/python3.10/site-packages/tilelang/utils/tensor.py", line 287, in torch_assert_close
    raise AssertionError(
AssertionError: Too many mismatched elements: 1077889 > 20971 (1.00% allowed, but get 51.40%).
First mismatch at index [0, 0, 0, 16]: lhs=495.000000, rhs=519.500000, abs_diff=24.500000, rel_diff=0.047161

setting are as following

program = convolution(
            N, C, H, W, F, K, S, D, P, tune=args.tune)(
                block_M=64, block_N=64, block_K=64, num_stages=1, threads=128)

jiangwenj02 avatar Mar 31 '25 06:03 jiangwenj02

@jiangwenj02 CC @Cunxiao2002 , likely we need to reproduce it on V100.

LeiWang1999 avatar Mar 31 '25 06:03 LeiWang1999

would be better to provide the entire scripts to help us reproduce.

LeiWang1999 avatar Nov 05 '25 12:11 LeiWang1999