flash-attention topics

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

DefTruth

awq

ee-llm

flash-attention

flash-attention-2

Qwen

14.4k

Stars

1.2k

Forks

Watchers

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

QwenLM

chinese

flash-attention

large-language-models

llm

Chinese-LLaMA-Alpaca-2

7.1k

Stars

578

Forks

Watchers

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

ymcui

64k

alpaca

alpaca-2

alpaca2

gdGPT

91

Stars

8

Forks

Watchers

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

CoinCheung

baichuan2-7b

bloom

chatglm3-6b

deepspeed

CUDA-Learn-Notes

1.2k

Stars

133

Forks

Watchers

🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.

DefTruth

cuda

cuda-kernels

cuda-programming

block-reduce