Haicheng Wu
Haicheng Wu
splitK is for large K and small M/N. It is used to saturate GPU. sliceK is used to reduce the shared memory traffic when the tile size is small.
> what is ”shared memory traffic“? loading and storing from/to the shared memory.
Which TF32 example do you use? It is a ampere new feature, it is supposed to use multi-stage mainloop which requires >= 3 stages.
tf32 is impossible to have higher accuracy than fp64. It is likely that your code has some bugs. You can take a look at https://github.com/NVIDIA/cutlass/tree/master/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm. It compares accuracy of tf32x1...
I made below change to example 27 to calculate the accuracy of tf32 vs fp64 ``` diff --git a/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm.cu b/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_t ensorop_gemm.cu index 06559637..fe1e7bb7 100644 --- a/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm.cu +++ b/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm.cu @@ -115,18...
```/home/cydoroga/arc/arcadia/junk/cydoroga/dynamic_parallelism/test.cu(82): error: no instance of constructor "cutlass::gemm::kernel::Gemm::Params::Params``` Why there are two `Params` in the end?
cutlass `GemmUniversal` supports both modes of batched gemm. See https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm_universal.h Set `mode` [here](https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/gemm_universal.h#L110) to `kBatched` or `kArray` (https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/gemm.h#L407-L408). This [example](https://github.com/NVIDIA/cutlass/tree/master/examples/36_gather_scatter_fusion) can gather inputs for a GEMM. It loads the row...
What kind of kernels do you need? GEMM or CONV, data types, tensor cores or not, architectures, layouts of inputs and outputs, any kernel fusion? CUTLASS provides lots of modules...
We have no itention to support small alignment in b2b gemm. It is not a good idea to apply b2b on not well aligned inputs. Padding matices is easy and...