Haicheng Wu

Results 323 comments of Haicheng Wu

> Is b1 x b1 GEMMs all implemented by XOR that requires uint1_t x uint1_t ? We supports both `xor_popc` and `and_popc`. See this one https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/arch/mma_sm80.h#L2017 and this one https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/arch/mma_sm80.h#L2084...

Maybe it is a question to the paper author. I don't know how to pack signed bit and data bit into 1 bit. xoring with signed bit also sounds strange...

We have many f64 or complex f64 unit tests here: https://github.com/NVIDIA/cutlass/tree/master/test/unit/gemm/device

`device::gemm_universal` and `device::gemm` are the same if you don't run splitK or batched gemm. You can dump all the arguments to verify that both cases are running the same problem....

Have your lowered the frequency? Do you have one or two warmup runs? cutlass profiler has a warmup run. Usually the warmup run is much slower than the followup runs.

Then, I will really have no idea. Here is the code of `device::gemm` https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/gemm.h#L187-L352 Here is the code of `device::gemm_universal` https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/gemm_universal.h The only difference that you can see is that...

Some one also reported before in the github that changing T4 driver also has big impact on the performance. Maybe you can try that too.

Also cutlass nvcc command line is like this ``` nvcc -DCUTLASS_ENABLE_CUBLAS=1 -DCUTLASS_NAMESPACE=cutlass -I/home/scratch.haichengw_gpu/cutlass_public/include -I/home/scratch.haichengw_gpu/cutlass_public/examples/common -I/home/scratch.haichengw_gpu/cutlass_public/build/include -I/home/scratch.haichengw_gpu/cutlass_public/tools/util/include -O3 -DNDEBUG -Xcompiler=-fPIE -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -gencode=arch=compute_75,code=[sm_75,compute_75] -std=c++11 -x cu...

Depth conv will remain to be simt kernel in cutlass 2.11. But the perf will be much better.