gemm topic
CTranslate2
Fast inference engine for Transformer models
laser
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats a...
blislab
BLISlab: A Sandbox for Optimizing GEMM
Tensile
Stretching GPU performance for GEMMs and tensor contractions.
Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
dbcsr
DBCSR: Distributed Block Compressed Sparse Row matrix library
cublasgemm-benchmark
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm