Haicheng Wu comments

Results 323 comments of


                                            Haicheng Wu

[QST] Any support or examples of uint1_t x int1_t GEMM?

> Is b1 x b1 GEMMs all implemented by XOR that requires uint1_t x uint1_t ? We supports both `xor_popc` and `and_popc`. See this one https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/arch/mma_sm80.h#L2017 and this one https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/arch/mma_sm80.h#L2084...

[QST] Any support or examples of uint1_t x int1_t GEMM?

Maybe it is a question to the paper author. I don't know how to pack signed bit and data bit into 1 bit. xoring with signed bit also sounds strange...

[QST] A100 double-precision Tensor Cores ?

We have many f64 or complex f64 unit tests here: https://github.com/NVIDIA/cutlass/tree/master/test/unit/gemm/device

[QST] Profiling difference between GemmUinversal and Gemm?

`device::gemm_universal` and `device::gemm` are the same if you don't run splitK or batched gemm. You can dump all the arguments to verify that both cases are running the same problem....

[QST] Profiling difference between GemmUinversal and Gemm?

Have your lowered the frequency? Do you have one or two warmup runs? cutlass profiler has a warmup run. Usually the warmup run is much slower than the followup runs.

[QST] Profiling difference between GemmUinversal and Gemm?

Then, I will really have no idea. Here is the code of `device::gemm` https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/gemm.h#L187-L352 Here is the code of `device::gemm_universal` https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/gemm_universal.h The only difference that you can see is that...

[QST] Profiling difference between GemmUinversal and Gemm?

Some one also reported before in the github that changing T4 driver also has big impact on the performance. Maybe you can try that too.

[QST] Profiling difference between GemmUinversal and Gemm?

Also cutlass nvcc command line is like this ``` nvcc -DCUTLASS_ENABLE_CUBLAS=1 -DCUTLASS_NAMESPACE=cutlass -I/home/scratch.haichengw_gpu/cutlass_public/include -I/home/scratch.haichengw_gpu/cutlass_public/examples/common -I/home/scratch.haichengw_gpu/cutlass_public/build/include -I/home/scratch.haichengw_gpu/cutlass_public/tools/util/include -O3 -DNDEBUG -Xcompiler=-fPIE -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -gencode=arch=compute_75,code=[sm_75,compute_75] -std=c++11 -x cu...

[BUG] CUDA Error CUresult.CUDA_ERROR_ILLEGAL_ADDRESS when using cutlass_tensorop_s1688tf32gemm op

2.10 reimplemented pycutlass. please give it a try.

ops.conv2d(group=256) outputs NaN and Inf

Depth conv will remain to be simt kernel in cutlass 2.11. But the perf will be much better.