lm-inference-engines icon indicating copy to clipboard operation
lm-inference-engines copied to clipboard

Comparison of Language Model Inference Engines

Open Source LLM Inference Engines

Overview of popular open source large language model inference engines. An inference engine is the program which loads a models weights and generates text responses based on given inputs.

Feel free to create a PR or issue if you want a new engine column, feature row, or update a status.

Compared Inference Engines

  • vLLM: Designed to provide SOTA throughput.
  • TensorRT-LLM: Nvidias design for a high performance extensible pytorch-like API for use with Nvidia Triton Inference Server.
  • llama.cpp: Pure C++ without any dependencies, with Apple Silicon prioritized.
  • TGI: HuggingFace' fast and flexible engine designed for high throughput.
  • LightLLM: Lightweight, fast and flexible framework targeting performance, written purely in Python / Triton.
  • DeepSpeed-MII / DeepSpeed-FastGen: Microsofts high performance implementation including SOTA Dynamic Splitfuse
  • ExLlamaV2: Efficiently run language models on modern consumer GPUs. Implements SOTA quantization method, EXL2.

Comparison Table

✅ Included | 🟠 Inferior Alternative | đŸŒŠī¸ Exists but has Issues | 🔨 PR | đŸ—“ī¸ Planned |❓ Unclear / Unofficial | ❌ Not Implemented

vLLM TensorRT-LLM llama.cpp TGI LightLLM Fastgen ExLlamaV2
Optimizations
FlashAttention2 ✅ ^4 ✅ [^16] 🟠 [^43] ✅ ^1 ✅ ✅ ✅
PagedAttention ✅ ^1 ✅ [^16] ❌ ^10 ✅ 🟠*** [^19] ✅ ✅ [^47]
Speculative Decoding 🔨 ^8 ✅ ^2 ✅ ^11 ✅ ^3 ❌ ❌ [^27] ✅
Tensor Parallel ✅ ✅ [^17] 🟠** ^12 ✅ ^5 ✅ ✅ [^25] ❌
Pipeline Parallel ✅ [^36] ✅ [^45] ❌ [^46] ❓ ^5 ❌ ❌ [^26] ❌
Optim. / Scheduler
Dyn. SplitFuse (SOTA[^22]) đŸ—“ī¸ [^22] đŸ—“ī¸ [^29] ❌ ❌ ❌ ✅ [^22] ❌
Efficient Rtr (better) ❌ ❌ ❌ ❌ ✅ [^24] ❌ ❌
Cont. Batching ✅ [^22] ✅ [^23] ✅ ✅ ❌ ✅ [^25] ❓ [^37]
Optim. / Quant
EXL2 (SOTA[^35]) 🔨 [^34] ❌ ❌ ✅ [^40] ❌ ❌ ✅
AWQ đŸŒŠī¸ [^39] ✅ ❌ ✅ ❌ ❌ ❌
Other Quants (yes) [^30] GPTQ GGUF [^31] (yes) [^18] ? ? ?
Features
OpenAI-Style API ✅ ❌ [^42] ✅ [^13] ✅ [^44] ✅ [^20] ❌ ❌
Feat. / Sampling
Beam Search ✅ ✅ [^16] ✅ ^14 🟠**** ^7 ❌ ❌ [^28] ❌ [^38]
JSON / Grammars via Outlines ✅ đŸ—“ī¸ ✅ ✅ ? ? ✅
Models
Llama 2 / 3 ✅ ✅ ✅ ✅ ✅ ✅ ✅
Mistral ✅ ✅ ✅ ✅ ✅ [^21] ✅ ✅
Mixtral ✅ ✅ ✅ ✅ ✅ ✅ ✅
Implementation
Core Language Python C++ C++ Py / Rust Python Python Python
GPU API CUDA* CUDA* Metal / CUDA CUDA* Triton / CUDA CUDA* CUDA
Repo
License Apache 2 Apache 2 MIT Apache 2 [^15] Apache 2 Apache 2 MIT
Github Stars 17K 6K 54K 8K 2K 2K 3K

Benchmarks

Notes

*Supports Triton for one-off such as FlashAttention (FusedAttention) / quantization, or allows Triton plugins, however the project doesn't use Triton otherwise.

**Sequentially processed tensor split

***"TokenAttention is the special case of PagedAttention when block size equals to 1, which we have tested before and find it under-utilizes GPU compute compared to larger block size. Unless LightLLM's Triton kernel implementation is surprisingly fast, this should not bring speedup."

****TGI maintainers suggest using best_of instead of beam search. (best_of creates n generations and selects the one with the lowest logprob). Anecdotally, beam search is much better at finding the best generation for "non-creative" tasks.

[^15]: https://raw.githubusercontent.com/huggingface/text-generation-inference/main/LICENSE, https://twitter.com/julien_c/status/1777328456709062848 [^16]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_attention.md [^17]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp#L184 [^18]: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/cli.py#L15-L21 [^19]: https://github.com/ModelTC/lightllm/blob/main/docs/TokenAttention.md [^20]: https://github.com/ModelTC/lightllm/blob/main/lightllm/server/api_models.py#L9 [^21]: https://github.com/ModelTC/lightllm/issues/224#issuecomment-1827365514 [^22]: https://blog.vllm.ai/2023/11/14/notes-vllm-vs-deepspeed.html, https://github.com/vllm-project/vllm/issues/1562 [^23]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md [^24]: https://github.com/ModelTC/lightllm/blob/a9cf0152ad84beb663cddaf93a784092a47d1515/docs/LightLLM.md#efficient-router [^25]: https://github.com/microsoft/DeepSpeed-MII [^26]: https://github.com/microsoft/DeepSpeed-MII/issues/329#issuecomment-1830317364 [^27]: https://github.com/microsoft/DeepSpeed-MII/issues/254 [^28]: https://github.com/microsoft/DeepSpeed-MII/issues/286#issuecomment-1808510043 [^29]: https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752 [^30]: https://github.com/vllm-project/vllm/blob/1f24755bf802a2061bd46f3dd1191b7898f13f45/vllm/model_executor/quantization_utils/squeezellm.py#L8 [^31]: https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md [^34]: https://github.com/vllm-project/vllm/issues/296 [^35]: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/#pareto-frontiers [^36]: https://github.com/vllm-project/vllm/issues/387 [^37]: https://github.com/turboderp/exllamav2/discussions/19#discussioncomment-6989460 [^38]: https://github.com/turboderp/exllamav2/issues/84 [^39]: https://github.com/vllm-project/vllm/blob/main/docs/source/quantization/auto_awq.rst [^40]: https://github.com/huggingface/text-generation-inference/pull/1211 [^41]: Via https://github.com/outlines-dev/outlines [^42]: https://github.com/NVIDIA/TensorRT-LLM/issues/334 [^42]: https://github.com/ggerganov/llama.cpp/blob/master/examples/json-schema-to-grammar.py [^43]: https://github.com/ggerganov/llama.cpp/pull/5021 FlashAttention, but not FlashAttention2 [^44]: https://huggingface.co/docs/text-generation-inference/messages_api [^45]: https://github.com/NVIDIA/TensorRT-LLM/blob/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/tensorrt_llm/auto_parallel/config.py#L35 [^46]: "without specific architecture tricks, you will only be using one GPU at a time, and your performance will suffer compared to a single GPU due to communication and synchronization overhead." https://github.com/ggerganov/llama.cpp/issues/4238#issuecomment-1832768597 [^47]: https://github.com/turboderp/exllamav2/commit/affc3508c1d18e4294a5062f794f44112a8b07c5