cutlass
cutlass copied to clipboard
CUDA Templates for Linear Algebra Subroutines
I see in [discution](https://github.com/NVIDIA/cutlass/discussions/427) about GTC talk (S41606), you have developed a usefully code-gen script, However I did not find it in repo. Would you please tell where I can...
I am trying to runtime load offline compiled ptx using the same CUDA source file and launch kernel using cuLaunchKernel, but examples/16_ampere_tensorop_conv2dfprop failed with driver error code 1. ``` >...
Hello, I would like to implement a custom Pytorch kernel using the CUTLASS 2D convolution. I saw that you released Python scripts in release 2.9 to launch a GEMM kernel...
Also for this [case](https://github.com/NVIDIA/cutlass/blob/master/examples/13_two_tensor_op_fusion/b2b_gemm_f16t_f16n_f16t_tensor_op_f16_sm75.h#L44). I try to use some other parameter to verity the result, such as ` cutlass::gemm::GemmCoord gemm_f16_sm75_problem_size_0(10, 64, 576); cutlass::gemm::GemmCoord gemm_f16_sm75_problem_size_1(10, 128, 64); ` it run ok,...
Test code for this version can be found in `examples/37_gemm_layernorm_gemm_fusion/gemm_layernorm_bias_residual.cu`. Things need to be modified are marked as `TODO`
This is the original gemm_universal_with_broadcast PR written at April. The added unittest test/unit/gemm/device/gemm_broadcast_test.cu passed at that time. But now it cannot pass any more.
**Describe the bug** I am trying to do a gemm between two fp32 arrays using the python api to produce a fp32 output. I would like to leverage tensor cores...
I ran the example 57_hopper_grouped_gemm with different options and found that the performance degrades when beta != 0. For example, if you run the following command `./examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm --m=5120 --n=1280 --k=256...
Refer to #1316, I have tried 55th example: 55_hopper_mixed_dtype_gemm. It works fine for w4a8 groupsize=128, which incudes changes from baseline like: `using MmaType = int8_t;` `using ElementC = int32_t; `...
Hi everybody, I'm currently trying to writing a trainer for a very small an oddly shaped network which requires a lot of gather/scatter. E.g. one layer looks like this: C...