cutlass issues

[FEA] Add Prefetching Hints Support for Global Memory Loading

6

**Is your feature request related to a problem? Please describe.** Cutlass has integrated Level 2 (L2) prefetch hints for global memory load to register (LDG) in its implementation(code at https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/arch/memory.h#L161)....

reed-lau

feature request

inactive-30d

inactive-90d

[QST] What is the definition and difference of two partition methods in cute?

6

``` TiledMMA tiled_mma; auto thr_mma = tiled_mma.get_slice(threadIdx.x); auto tAgA = thr_mma.partition_A(gA); // (MMA, MMA_M, MMA_K, num_tile_k) auto tBgB = thr_mma.partition_B(gB); // (MMA, MMA_N, MMA_K, num_tile_k) auto tCgC = thr_mma.partition_C(gC); //...

ziyuhuang123

question

? - Needs Triage

inactive-30d

CuTe

[BUG] Copy_Atom with DefaultCopy cause misaligned address

4

**Describe the bug** As of b7508e337938137a699e486d8997646980acfc58, `Copy_Atom` cause misaligned address. **Steps/Code to reproduce bug** ```cuda #include using namespace cute; __global__ void kernel(int m, int k, float* a, int lda) {...

cloudhan

bug

inactive-30d

CuTe

[QST] Is there any INT8 GEMM with INT8 alpha and beta?

8

**What is your question?** ``` using Gemm = cutlass::gemm::device::Gemm< int8_t, cutlass::layout::RowMajor, int8_t, cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor, ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm75>; cutlass::gemm::GemmCoord problem_size(M, N, K); cutlass::TensorRef input_ref(input.data_ptr(), LayoutInputA::packed(input_size)); cutlass::TensorRef weight_ref(weight.data_ptr(), LayoutInputB::packed(weight_size)); cutlass::TensorRef out_ref(out.data_ptr(),...

jhss

question

inactive-30d

[QST] Question about "global_load" code in "cutlass/arch/memory.h"

4

The "global_load" function is used to copy data in global memory to a local array(maybe in registers). For examples: ```c++ template struct global_load { CUTLASS_DEVICE global_load(AccessType &D, void const *ptr,...

bmmcq

question

inactive-30d

[QST] how bank conflict in shared memory is fixed in depthwise conv

6

**What is your question?** bank conflict plays extremely important role in smem perf. how is it solved in depthwise conv? @Ethan-Yan27

yupatrick22

question

inactive-30d

[QST] example 13_two_tensor_op_fusion, what to do with no relu

3

Hello, I just found the example 13_two_tensor_op_fusion. And I'm trying to extract this example for general purpose. But I want to get Z=X@W1@W2, so I try to replace [this line](https://github.com/NVIDIA/cutlass/blob/30ec1a464961df8e511a1562976a7a8527e1e554/examples/13_two_tensor_op_fusion/fused_two_gemms_grouped_f16_sm80_rf.cu#L201)...

harryhan618

question

inactive-30d

[BUG] tensor.hpp must be included before mma_atom.hpp

9

**Describe the bug** I find that `` must be included before ``, otherwise we cannot compile. **Steps/Code to reproduce bug** Compile with `nvcc test.cu -I include/ -std=c++17` ```c++ #include //...

mammoth831

bug

[QST] Any experience to help me optimize on a particular problem size(a small M)?

5

I am a beginner to cutlass and I have reviewed many related documents and examples；I also have a general understanding of ThreadBlockShape, WarpShape, InstructionShape, NumStages; In my current problem domain,...

HuaYZhao

question

[QST]Is this the complete set of valid parameters for performing fp16 matrix multiplication using tensor cores?

12

**What is your question?** In this [website](https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/gemm_f16n_f16n_f32n_tensor_op_f32_sm80.cu), there are many parameters, but may I ask if the parameters listed on this page are already all the valid ones?

zwshan

question

cutlass
cutlass copied to clipboard

Metadata

[FEA] Add Prefetching Hints Support for Global Memory Loading

[QST] What is the definition and difference of two partition methods in cute?

[BUG] Copy_Atom with DefaultCopy cause misaligned address

[QST] Is there any INT8 GEMM with INT8 alpha and beta?

[QST] Question about "global_load" code in "cutlass/arch/memory.h"

[QST] how bank conflict in shared memory is fixed in depthwise conv

[QST] example 13_two_tensor_op_fusion, what to do with no relu

[BUG] tensor.hpp must be included before mma_atom.hpp

[QST] Any experience to help me optimize on a particular problem size(a small M)?

[QST]Is this the complete set of valid parameters for performing fp16 matrix multiplication using tensor cores?

← Metadata

Owner

Metadata

cutlass cutlass copied to clipboard

Metadata

← Metadata

Owner

Metadata

cutlass
cutlass copied to clipboard