llm-serving topics

torchpipe

141

Stars

12

Forks

Watchers

Serving Inside Pytorch

torchpipe

deployment

inference

pipeline-parallelism

pytorch

sglang

20.3k

Stars

3.5k

Forks

20.3k

Watchers

SGLang is a fast serving framework for large language models and vision language models.

sgl-project

cuda

inference

llama

llama2

swiftLLM

85

Stars

6

Forks

Watchers

A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).

interestingLSY

cuda

gpt

inference

inference-engine

Awesome-LLMs-ICLR-24

66

Stars

4

Forks

66

Watchers

It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.

azminewasi

large-language-model

large-language-models

large-language-models-and-translation-systems

large-language-models-for-graph-learning

Nanoflow

522

Stars

18

Forks

Watchers

A throughput-oriented high-performance serving framework for LLMs

efeslab

cuda

inference

llama2

llm

Z1

66

Stars

2

Forks

66

Watchers

[EMNLP'2025 Industry] Repo for "Z1: Efficient Test-time Scaling with Code"

efficientscaling

codellms

llm

llm-serving

reasoning

embeddedllm

46

Stars

2

Forks

46

Watchers

EmbeddedLLM: API server for Embedded Device Deployment. Currently support CUDA/OpenVINO/IpexLLM/DirectML/CPU

EmbeddedLLM

aipc

cpu

directml

directx-12

gpustack

4.1k

Stars

412

Forks

4.1k

Watchers

GPU cluster manager for optimized AI model deployment

gpustack

ascend

cuda

deepseek

distributed-inference

kvcached

682

Stars

67

Forks

682

Watchers

Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond

ovg-project

elastic-kvcache

gpu-mutiplexing

gpu-sharing

inference-engine

blackbird

39

Stars

4

Forks

39

Watchers

A high-performance RDMA distributed file system for fast LLM Inference and GPU Training

blackbird-io

big-data

cpp

cuda

distributed-cache