Fast Distributed Inference Serving for Large Language Models |
Distributed inference serving |
PKU |
Arxiv |
|
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving |
Pipeline Parallel; Auto parallel |
UCB |
OSDI 2023 |
Github repo |
Orca: A Distributed Serving System for Transformer-Based Generative Models |
Continuous batching |
Seoul National University |
OSDI2022 |
|
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads |
Multiple Decoding Heads |
Princeton University |
Arxiv |
Github repo |
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU |
Consumer-grade GPU |
SJTU |
Arxiv |
Github repo |
LLM in a flash: Efficient Large Language Model Inference with Limited Memory |
flash; Pruning |
Apple |
Arxiv |
|
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline |
Length Perception |
NUS |
NeurIPS 2023 |
Github repo |
S3: Increasing GPU Utilization during Generative Inference for Higher Throughput |
|
Harvard University |
Arxiv |
|
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving |
Decouple |
PKU |
OSDI 2024 |
|
Splitwise: Efficient generative LLM inference using phase splitting |
Decouple |
UW |
ISCA 2024 |
Track issue |
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU |
Single GPU |
Stanford University |
Arxiv |
Github repo |
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve |
Decouple |
GaTech |
OSDI 2024 |
|
SpotServe: Serving Generative Large Language Models on Preemptible Instances |
Preemptible GPU |
CMU |
ASPLOS 2024 |
Empty Github repo |
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification |
Tree-based Speculative |
CMU |
ASPLOS 2024 |
|
AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving |
Cache the multi-turn prefill KV-cache in host-DRAM and SSD |
NUS |
ATC 2024 |
|
MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving |
Use spatial-temporal multiplexing method to serve multi-LLMs |
MMLab |
Arxiv |
|
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference |
KV Cache Compression |
Shanghai Jiao Tong University |
Arxiv |
|
You Only Cache Once: Decoder-Decoder Architectures for Language Models |
KV Cache |
Microsoft Research |
Arxiv |
|
Better & Faster Large Language Models via Multi-token Prediction |
Multi-token Prediction |
Meta |
Arxiv |
|
ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference |
Decouple |
Hanyang University |
ASPLOS 2024 |
|
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable |
LLM Applications |
SJTU |
OSDI 2024 |
|
Fairness in Serving Large Language Models |
Fairness; LLM Serving |
UC Berkeley,Stanford University |
OSDI 2024 |
|
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving |
KV Cache |
Moonshot AI |
Github |
|
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention |
Pre-fillingfor Long-Context Dynamic Sparse Attention |
Microsoft |
Arxiv |
Github repo |
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool |
Memory Pool |
Huawei |
Arxiv |
|
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management |
sparisity |
Seoul National University |
OSDI 2024 |
|
Llumnix: Dynamic Scheduling for Large Language Model Serving |
Preemptible GPU |
Alibaba Group |
OSDI 2024 |
|
PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch |
Multi-Agent |
Tsinghua University |
ATC 2024 |
|
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention |
Sparsity; Long context |
PKU |
Arxiv |
|
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference |
Sparsity; Related token |
MIT |
ICML 2024 |
|
Accelerating Production LLMs with Combined Token/Embedding Speculators |
Speculative decoding |
IBM Research |
Arxiv |
Github repo |
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference |
KV Cache |
Apple |
Arxiv |
|
Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU |
Attention Saddles,KV cache |
Shanghai Jiao Tong University |
Arxiv |
|
TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text |
KV Cache for RAG |
Moore Threads AI |
Arxiv |
Github repo |
Efficient Streaming Language Models with Attention Sinks |
StreamingLLM, Static sparsity |
MIT |
ICLR 2024 |
Github repo |
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models |
sparsity attention |
UT Austin |
nips 2024 |
|
SparQ Attention: Bandwidth-Efficient LLM Inference |
sparsity attention |
GraphCore |
ICML 2024 |
|
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention |
sparsity attention |
msra |
arxiv |
|
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval |
Vector Retrieval |
msra |
arxiv |
|
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion |
KV Cache cross |
University of Chicago |
Eurosys |
|
Epic: Efficient Position-Independent Context Caching for Serving Large Language Models |
Position independent |
PKU |
arxiv |
|
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving |
KV Cache compression |
University of Chicage |
sigcomm |
|
SCOPE:OptimizingKey-Value Cache Compression in Long-context Generation |
Separate handling of prefill and decoding KV Cache |
SEU |
arxiv 2024 |
|
FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines |
Heterogeneous pipelines |
THU |
arxiv 2024 |
|