Bruce-Lee-LY

Results 6 repositories owned by Bruce-Lee-LY

cuda_hgemm

270
Stars
62
Forks
Watchers

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

cuda_hook

129
Stars
33
Forks
Watchers

Hooked CUDA-related dynamic libraries by using automated code generation tools.

cuda_hgemv

48
Stars
4
Forks
Watchers

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

flash_attention_inference

20
Stars
2
Forks
Watchers

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

decoding_attention

46
Stars
4
Forks
46
Watchers

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

matrix_multiply

15
Stars
3
Forks
15
Watchers

Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.