Open Source LLM Inference Engines

Overview of popular open source large language model inference engines. An inference engine is the program which loads a models weights and generates text responses based on given inputs.

Feel free to create a PR or issue if you want a new engine column, feature row, or update a status.

Compared Inference Engines

vLLM: Designed to provide SOTA throughput.
TensorRT-LLM: Nvidias design for a high performance extensible pytorch-like API for use with Nvidia Triton Inference Server.
llama.cpp: Pure C++ without any dependencies, with Apple Silicon prioritized.
TGI: HuggingFace' fast and flexible engine designed for high throughput.
LightLLM: Lightweight, fast and flexible framework targeting performance, written purely in Python / Triton.
DeepSpeed-MII / DeepSpeed-FastGen: Microsofts high performance implementation including SOTA Dynamic Splitfuse
ExLlamaV2: Efficiently run language models on modern consumer GPUs. Implements SOTA quantization method, EXL2.

Comparison Table

	vLLM	TensorRT-LLM	llama.cpp	TGI	LightLLM	Fastgen	ExLlamaV2
Optimizations
FlashAttention2	✅ ^4	✅ [^16]	🟠 [^43]	✅ ^1	✅	✅	✅
PagedAttention	✅ ^1	✅ [^16]	❌ ^10	✅	🟠*** [^19]	✅	✅ [^47]
Speculative Decoding	🔨 ^8	✅ ^2	✅ ^11	✅ ^3	❌	❌ [^27]	✅
Tensor Parallel	✅	✅ [^17]	🟠** ^12	✅ ^5	✅	✅ [^25]	❌
Pipeline Parallel	✅ [^36]	✅ [^45]	❌ [^46]	❓ ^5	❌	❌ [^26]	❌
Optim. / Scheduler
Dyn. SplitFuse (SOTA[^22])	🗓️ [^22]	🗓️ [^29]	❌	❌	❌	✅ [^22]	❌
Efficient Rtr (better)	❌	❌	❌	❌	✅ [^24]	❌	❌
Cont. Batching	✅ [^22]	✅ [^23]	✅	✅	❌	✅ [^25]	❓ [^37]
Optim. / Quant
EXL2 (SOTA[^35])	🔨 [^34]	❌	❌	✅ [^40]	❌	❌	✅
AWQ	🌩️ [^39]	✅	❌	✅	❌	❌	❌
Other Quants	(yes) [^30]	GPTQ	GGUF [^31]	(yes) [^18]	?	?	?
Features
OpenAI-Style API	✅	❌ [^42]	✅ [^13]	✅ [^44]	✅ [^20]	❌	❌
Feat. / Sampling
Beam Search	✅	✅ [^16]	✅ ^14	🟠**** ^7	❌	❌ [^28]	❌ [^38]
JSON / Grammars via Outlines	✅	🗓️	✅	✅	?	?	✅
Models
Llama 2 / 3	✅	✅	✅	✅	✅	✅	✅
Mistral	✅	✅	✅	✅	✅ [^21]	✅	✅
Mixtral	✅	✅	✅	✅	✅	✅	✅
Implementation
Core Language	Python	C++	C++	Py / Rust	Python	Python	Python
GPU API	CUDA*	CUDA*	Metal / CUDA	CUDA*	Triton / CUDA	CUDA*	CUDA
Repo
License	Apache 2	Apache 2	MIT	Apache 2 [^15]	Apache 2	Apache 2	MIT
Github Stars	17K	6K	54K	8K	2K	2K	3K

Benchmarks

BentoML (June 5th, 2024): Compares LMDeploy, MLC-LLM, TGI, TRT-LLM, vLLM

Notes

*Supports Triton for one-off such as FlashAttention (FusedAttention) / quantization, or allows Triton plugins, however the project doesn't use Triton otherwise.

**Sequentially processed tensor split

***"TokenAttention is the special case of PagedAttention when block size equals to 1, which we have tested before and find it under-utilizes GPU compute compared to larger block size. Unless LightLLM's Triton kernel implementation is surprisingly fast, this should not bring speedup."

****TGI maintainers suggest using best_of instead of beam search. (best_of creates n generations and selects the one with the lowest logprob). Anecdotally, beam search is much better at finding the best generation for "non-creative" tasks.

[^15]: https://raw.githubusercontent.com/huggingface/text-generation-inference/main/LICENSE, https://twitter.com/julien_c/status/1777328456709062848 [^16]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_attention.md [^17]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp#L184 [^18]: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/cli.py#L15-L21 [^19]: https://github.com/ModelTC/lightllm/blob/main/docs/TokenAttention.md [^20]: https://github.com/ModelTC/lightllm/blob/main/lightllm/server/api_models.py#L9 [^21]: https://github.com/ModelTC/lightllm/issues/224#issuecomment-1827365514 [^22]: https://blog.vllm.ai/2023/11/14/notes-vllm-vs-deepspeed.html, https://github.com/vllm-project/vllm/issues/1562 [^23]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md [^24]: https://github.com/ModelTC/lightllm/blob/a9cf0152ad84beb663cddaf93a784092a47d1515/docs/LightLLM.md#efficient-router [^25]: https://github.com/microsoft/DeepSpeed-MII [^26]: https://github.com/microsoft/DeepSpeed-MII/issues/329#issuecomment-1830317364 [^27]: https://github.com/microsoft/DeepSpeed-MII/issues/254 [^28]: https://github.com/microsoft/DeepSpeed-MII/issues/286#issuecomment-1808510043 [^29]: https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752 [^30]: https://github.com/vllm-project/vllm/blob/1f24755bf802a2061bd46f3dd1191b7898f13f45/vllm/model_executor/quantization_utils/squeezellm.py#L8 [^31]: https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md [^34]: https://github.com/vllm-project/vllm/issues/296 [^35]: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/#pareto-frontiers [^36]: https://github.com/vllm-project/vllm/issues/387 [^37]: https://github.com/turboderp/exllamav2/discussions/19#discussioncomment-6989460 [^38]: https://github.com/turboderp/exllamav2/issues/84 [^39]: https://github.com/vllm-project/vllm/blob/main/docs/source/quantization/auto_awq.rst [^40]: https://github.com/huggingface/text-generation-inference/pull/1211 [^41]: Via https://github.com/outlines-dev/outlines [^42]: https://github.com/NVIDIA/TensorRT-LLM/issues/334 [^42]: https://github.com/ggerganov/llama.cpp/blob/master/examples/json-schema-to-grammar.py [^43]: https://github.com/ggerganov/llama.cpp/pull/5021 FlashAttention, but not FlashAttention2 [^44]: https://huggingface.co/docs/text-generation-inference/messages_api [^45]: https://github.com/NVIDIA/TensorRT-LLM/blob/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/tensorrt_llm/auto_parallel/config.py#L35 [^46]: "without specific architecture tricks, you will only be using one GPU at a time, and your performance will suffer compared to a single GPU due to communication and synchronization overhead." https://github.com/ggerganov/llama.cpp/issues/4238#issuecomment-1832768597 [^47]: https://github.com/turboderp/exllamav2/commit/affc3508c1d18e4294a5062f794f44112a8b07c5

lm-inference-engines
lm-inference-engines copied to clipboard

Metadata

Open Source LLM Inference Engines

Compared Inference Engines

Comparison Table

Benchmarks

Notes

← Metadata

Owner

Metadata

lm-inference-engines lm-inference-engines copied to clipboard

Metadata

Open Source LLM Inference Engines

Compared Inference Engines

Comparison Table

Benchmarks

Notes

← Metadata

Owner

Metadata

lm-inference-engines
lm-inference-engines copied to clipboard