cutlass
cutlass copied to clipboard
CUDA Templates for Linear Algebra Subroutines
**Is your feature request related to a problem? Please describe.** Cutlass has integrated Level 2 (L2) prefetch hints for global memory load to register (LDG) in its implementation(code at https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/arch/memory.h#L161)....
``` TiledMMA tiled_mma; auto thr_mma = tiled_mma.get_slice(threadIdx.x); auto tAgA = thr_mma.partition_A(gA); // (MMA, MMA_M, MMA_K, num_tile_k) auto tBgB = thr_mma.partition_B(gB); // (MMA, MMA_N, MMA_K, num_tile_k) auto tCgC = thr_mma.partition_C(gC); //...
**Describe the bug** As of b7508e337938137a699e486d8997646980acfc58, `Copy_Atom` cause misaligned address. **Steps/Code to reproduce bug** ```cuda #include using namespace cute; __global__ void kernel(int m, int k, float* a, int lda) {...
**What is your question?** ``` using Gemm = cutlass::gemm::device::Gemm< int8_t, cutlass::layout::RowMajor, int8_t, cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor, ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm75>; cutlass::gemm::GemmCoord problem_size(M, N, K); cutlass::TensorRef input_ref(input.data_ptr(), LayoutInputA::packed(input_size)); cutlass::TensorRef weight_ref(weight.data_ptr(), LayoutInputB::packed(weight_size)); cutlass::TensorRef out_ref(out.data_ptr(),...
The "global_load" function is used to copy data in global memory to a local array(maybe in registers). For examples: ```c++ template struct global_load { CUTLASS_DEVICE global_load(AccessType &D, void const *ptr,...
**What is your question?** bank conflict plays extremely important role in smem perf. how is it solved in depthwise conv? @Ethan-Yan27
Hello, I just found the example 13_two_tensor_op_fusion. And I'm trying to extract this example for general purpose. But I want to get Z=X@W1@W2, so I try to replace [this line](https://github.com/NVIDIA/cutlass/blob/30ec1a464961df8e511a1562976a7a8527e1e554/examples/13_two_tensor_op_fusion/fused_two_gemms_grouped_f16_sm80_rf.cu#L201)...
**Describe the bug** I find that `` must be included before ``, otherwise we cannot compile. **Steps/Code to reproduce bug** Compile with `nvcc test.cu -I include/ -std=c++17` ```c++ #include //...
I am a beginner to cutlass and I have reviewed many related documents and examples;I also have a general understanding of ThreadBlockShape, WarpShape, InstructionShape, NumStages; In my current problem domain,...
**What is your question?** In this [website](https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/gemm_f16n_f16n_f32n_tensor_op_f32_sm80.cu), there are many parameters, but may I ask if the parameters listed on this page are already all the valid ones?