Haicheng Wu comments

Results 323 comments of


                                            Haicheng Wu

[QST] How slice K reduce the value?

splitK is for large K and small M/N. It is used to saturate GPU. sliceK is used to reduce the shared memory traffic when the tile size is small.

[QST] How slice K reduce the value?

> what is ”shared memory traffic“? loading and storing from/to the shared memory.

[BUG] CMake transitive target doesn't appear to work

@d-k-b

[QST] set different stages has different accuracy

Which TF32 example do you use? It is a ampere new feature, it is supposed to use multi-stage mainloop which requires >= 3 stages.

[QST] set different stages has different accuracy

tf32 is impossible to have higher accuracy than fp64. It is likely that your code has some bugs. You can take a look at https://github.com/NVIDIA/cutlass/tree/master/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm. It compares accuracy of tf32x1...

[QST] set different stages has different accuracy

I made below change to example 27 to calculate the accuracy of tf32 vs fp64 ``` diff --git a/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm.cu b/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_t ensorop_gemm.cu index 06559637..fe1e7bb7 100644 --- a/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm.cu +++ b/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm.cu @@ -115,18...

[QST] Running CUTLASS kernels from device global functions

```/home/cydoroga/arc/arcadia/junk/cydoroga/dynamic_parallelism/test.cu(82): error: no instance of constructor "cutlass::gemm::kernel::Gemm::Params::Params``` Why there are two `Params` in the end?

[QST] Running CUTLASS kernels from device global functions

cutlass `GemmUniversal` supports both modes of batched gemm. See https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm_universal.h Set `mode` [here](https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/gemm_universal.h#L110) to `kBatched` or `kArray` (https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/gemm.h#L407-L408). This [example](https://github.com/NVIDIA/cutlass/tree/master/examples/36_gather_scatter_fusion) can gather inputs for a GEMM. It loads the row...

[DOC]Require detailed description of `xxxThreadMap` and `xxxTileIterator`

What kind of kernels do you need? GEMM or CONV, data types, tensor cores or not, architectures, layouts of inputs and outputs, any kernel fusion? CUTLASS provides lots of modules...

b2b gemm residual

We have no itention to support small alignment in b2b gemm. It is not a good idea to apply b2b on not well aligned inputs. Padding matices is easy and...