lm-inference-engines
lm-inference-engines copied to clipboard
Comparison of Language Model Inference Engines
Open Source LLM Inference Engines
Overview of popular open source large language model inference engines. An inference engine is the program which loads a models weights and generates text responses based on given inputs.
Feel free to create a PR or issue if you want a new engine column, feature row, or update a status.
Compared Inference Engines
- vLLM: Designed to provide SOTA throughput.
- TensorRT-LLM: Nvidias design for a high performance extensible pytorch-like API for use with Nvidia Triton Inference Server.
- llama.cpp: Pure C++ without any dependencies, with Apple Silicon prioritized.
- TGI: HuggingFace' fast and flexible engine designed for high throughput.
- LightLLM: Lightweight, fast and flexible framework targeting performance, written purely in Python / Triton.
- DeepSpeed-MII / DeepSpeed-FastGen: Microsofts high performance implementation including SOTA Dynamic Splitfuse
- ExLlamaV2: Efficiently run language models on modern consumer GPUs. Implements SOTA quantization method, EXL2.
Comparison Table
â Included | đ Inferior Alternative | đŠī¸ Exists but has Issues | đ¨ PR | đī¸ Planned |â Unclear / Unofficial | â Not Implemented
| vLLM | TensorRT-LLM | llama.cpp | TGI | LightLLM | Fastgen | ExLlamaV2 | |
|---|---|---|---|---|---|---|---|
| Optimizations | |||||||
| FlashAttention2 | â ^4 | â [^16] | đ [^43] | â ^1 | â | â | â |
| PagedAttention | â ^1 | â [^16] | â ^10 | â | đ *** [^19] | â | â [^47] |
| Speculative Decoding | đ¨ ^8 | â ^2 | â ^11 | â ^3 | â | â [^27] | â |
| Tensor Parallel | â | â [^17] | đ ** ^12 | â ^5 | â | â [^25] | â |
| Pipeline Parallel | â [^36] | â [^45] | â [^46] | â ^5 | â | â [^26] | â |
| Optim. / Scheduler | |||||||
| Dyn. SplitFuse (SOTA[^22]) | đī¸ [^22] | đī¸ [^29] | â | â | â | â [^22] | â |
| Efficient Rtr (better) | â | â | â | â | â [^24] | â | â |
| Cont. Batching | â [^22] | â [^23] | â | â | â | â [^25] | â [^37] |
| Optim. / Quant | |||||||
| EXL2 (SOTA[^35]) | đ¨ [^34] | â | â | â [^40] | â | â | â |
| AWQ | đŠī¸ [^39] | â | â | â | â | â | â |
| Other Quants | (yes) [^30] | GPTQ | GGUF [^31] | (yes) [^18] | ? | ? | ? |
| Features | |||||||
| OpenAI-Style API | â | â [^42] | â [^13] | â [^44] | â [^20] | â | â |
| Feat. / Sampling | |||||||
| Beam Search | â | â [^16] | â ^14 | đ **** ^7 | â | â [^28] | â [^38] |
| JSON / Grammars via Outlines | â | đī¸ | â | â | ? | ? | â |
| Models | |||||||
| Llama 2 / 3 | â | â | â | â | â | â | â |
| Mistral | â | â | â | â | â [^21] | â | â |
| Mixtral | â | â | â | â | â | â | â |
| Implementation | |||||||
| Core Language | Python | C++ | C++ | Py / Rust | Python | Python | Python |
| GPU API | CUDA* | CUDA* | Metal / CUDA | CUDA* | Triton / CUDA | CUDA* | CUDA |
| Repo | |||||||
| License | Apache 2 | Apache 2 | MIT | Apache 2 [^15] | Apache 2 | Apache 2 | MIT |
| Github Stars | 17K | 6K | 54K | 8K | 2K | 2K | 3K |
Benchmarks
- BentoML (June 5th, 2024): Compares LMDeploy, MLC-LLM, TGI, TRT-LLM, vLLM
Notes
*Supports Triton for one-off such as FlashAttention (FusedAttention) / quantization, or allows Triton plugins, however the project doesn't use Triton otherwise.
**Sequentially processed tensor split
****TGI maintainers suggest using best_of instead of beam search. (best_of creates n generations and selects the one with the lowest logprob). Anecdotally, beam search is much better at finding the best generation for "non-creative" tasks.
[^15]: https://raw.githubusercontent.com/huggingface/text-generation-inference/main/LICENSE, https://twitter.com/julien_c/status/1777328456709062848 [^16]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_attention.md [^17]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp#L184 [^18]: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/cli.py#L15-L21 [^19]: https://github.com/ModelTC/lightllm/blob/main/docs/TokenAttention.md [^20]: https://github.com/ModelTC/lightllm/blob/main/lightllm/server/api_models.py#L9 [^21]: https://github.com/ModelTC/lightllm/issues/224#issuecomment-1827365514 [^22]: https://blog.vllm.ai/2023/11/14/notes-vllm-vs-deepspeed.html, https://github.com/vllm-project/vllm/issues/1562 [^23]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md [^24]: https://github.com/ModelTC/lightllm/blob/a9cf0152ad84beb663cddaf93a784092a47d1515/docs/LightLLM.md#efficient-router [^25]: https://github.com/microsoft/DeepSpeed-MII [^26]: https://github.com/microsoft/DeepSpeed-MII/issues/329#issuecomment-1830317364 [^27]: https://github.com/microsoft/DeepSpeed-MII/issues/254 [^28]: https://github.com/microsoft/DeepSpeed-MII/issues/286#issuecomment-1808510043 [^29]: https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752 [^30]: https://github.com/vllm-project/vllm/blob/1f24755bf802a2061bd46f3dd1191b7898f13f45/vllm/model_executor/quantization_utils/squeezellm.py#L8 [^31]: https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md [^34]: https://github.com/vllm-project/vllm/issues/296 [^35]: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/#pareto-frontiers [^36]: https://github.com/vllm-project/vllm/issues/387 [^37]: https://github.com/turboderp/exllamav2/discussions/19#discussioncomment-6989460 [^38]: https://github.com/turboderp/exllamav2/issues/84 [^39]: https://github.com/vllm-project/vllm/blob/main/docs/source/quantization/auto_awq.rst [^40]: https://github.com/huggingface/text-generation-inference/pull/1211 [^41]: Via https://github.com/outlines-dev/outlines [^42]: https://github.com/NVIDIA/TensorRT-LLM/issues/334 [^42]: https://github.com/ggerganov/llama.cpp/blob/master/examples/json-schema-to-grammar.py [^43]: https://github.com/ggerganov/llama.cpp/pull/5021 FlashAttention, but not FlashAttention2 [^44]: https://huggingface.co/docs/text-generation-inference/messages_api [^45]: https://github.com/NVIDIA/TensorRT-LLM/blob/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/tensorrt_llm/auto_parallel/config.py#L35 [^46]: "without specific architecture tricks, you will only be using one GPU at a time, and your performance will suffer compared to a single GPU due to communication and synchronization overhead." https://github.com/ggerganov/llama.cpp/issues/4238#issuecomment-1832768597 [^47]: https://github.com/turboderp/exllamav2/commit/affc3508c1d18e4294a5062f794f44112a8b07c5