ipex-llm
ipex-llm copied to clipboard
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Ma...
[!IMPORTANT]
bigdl-llmhas now becomeipex-llm(see the migration guide here); you may find the originalBigDLproject here.
💫 IPEX-LLM
IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency[^1].
[!NOTE]
- It is built on top of Intel Extension for PyTorch (
IPEX), as well as the excellent work ofllama.cpp,bitsandbytes,vLLM,qlora,AutoGPTQ,AutoAWQ, etc.- It provides seamless integration with llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
- 50+ models have been optimized/verified on
ipex-llm(including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.
ipex-llm Demo
See the demo of running Text-Generation-WebUI, local RAG using LangChain-Chatchat, llama.cpp and HuggingFace transformers (on either Intel Core Ultra laptop or Arc GPU) with ipex-llm below.
| Intel Core Ultra Laptop | Intel Arc GPU | ||
| Text-Generation-WebUI | Local RAG using LangChain-Chatchat | llama.cpp | HuggingFace transformers |
Latest Update 🔥
- [2024/04]
ipex-llmnow provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU. - [2024/03]
bigdl-llmhas now becomeipex-llm(see the migration guide here); you may find the originalBigDLproject here. - [2024/02]
ipex-llmnow supports directly loading model from ModelScope (魔搭). - [2024/02]
ipex-llmadded initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. - [2024/02] Users can now use
ipex-llmthrough Text-Generation-WebUI GUI. - [2024/02]
ipex-llmnow supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively. - [2024/02]
ipex-llmnow supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA). - [2024/01] Using
ipex-llmQLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).
More updates
- [2023/12]
ipex-llmnow supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates"). - [2023/12]
ipex-llmnow supports Mixtral-8x7B on both Intel GPU and CPU. - [2023/12]
ipex-llmnow supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models"). - [2023/12]
ipex-llmnow supports FP8 and FP4 inference on Intel GPU. - [2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into
ipex-llmis available. - [2023/11]
ipex-llmnow supports vLLM continuous batching on both Intel GPU and CPU. - [2023/10]
ipex-llmnow supports QLoRA finetuning on both Intel GPU and CPU. - [2023/10]
ipex-llmnow supports FastChat serving on on both Intel CPU and GPU. - [2023/09]
ipex-llmnow supports Intel GPU (including iGPU, Arc, Flex and MAX). - [2023/09]
ipex-llmtutorial is released.
[^1]: Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.
ipex-llm Quickstart
Install ipex-llm
- Windows GPU: installing
ipex-llmon Windows with Intel GPU - Linux GPU: installing
ipex-llmon Linux with Intel GPU - Docker: using
ipex-llmdockers on Intel CPU and GPU - For more details, please refer to the installation guide
Run ipex-llm
- llama.cpp: running llama.cpp (using C++ interface of
ipex-llmas an accelerated backend forllama.cpp) on Intel GPU - ollama: running ollama (using C++ interface of
ipex-llmas an accelerated backend forollama) on Intel GPU - vLLM: running
ipex-llminvLLMon both Intel GPU and CPU - FastChat: running
ipex-llminFastChatserving on on both Intel GPU and CPU - LangChain-Chatchat RAG: running
ipex-llminLangChain-Chatchat(Knowledge Base QA using RAG pipeline) - Text-Generation-WebUI: running
ipex-llminoobaboogaWebUI - Benchmarking: running (latency and throughput) benchmarks for
ipex-llmon Intel CPU and GPU
Code Examples
- Low bit inference
- INT4 inference: INT4 LLM inference on Intel GPU and CPU
- FP8/FP4 inference: FP8 and FP4 LLM inference on Intel GPU
- INT8 inference: INT8 LLM inference on Intel GPU and CPU
- INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
- FP16/BF16 inference
- FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
- BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
- Save and load
- Low-bit models: saving and loading
ipex-llmlow-bit models - GGUF: directly loading GGUF models into
ipex-llm - AWQ: directly loading AWQ models into
ipex-llm - GPTQ: directly loading GPTQ models into
ipex-llm
- Low-bit models: saving and loading
- Finetuning
- LLM finetuning on Intel GPU, including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA
- QLoRA finetuning on Intel CPU
- Integration with community libraries
- HuggingFace tansformers
- Standard PyTorch model
- DeepSpeed-AutoTP
- HuggingFace PEFT
- HuggingFace TRL
- LangChain
- LlamaIndex
- AutoGen
- ModeScope
- Tutorials
For more details, please refer to the ipex-llm document website.
Verified Models
Over 50 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.
| Model | CPU Example | GPU Example |
|---|---|---|
| LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) | link1, link2 | link |
| LLaMA 2 | link1, link2 | link |
| ChatGLM | link | |
| ChatGLM2 | link | link |
| ChatGLM3 | link | link |
| Mistral | link | link |
| Mixtral | link | link |
| Falcon | link | link |
| MPT | link | link |
| Dolly-v1 | link | link |
| Dolly-v2 | link | link |
| Replit Code | link | link |
| RedPajama | link1, link2 | |
| Phoenix | link1, link2 | |
| StarCoder | link1, link2 | link |
| Baichuan | link | link |
| Baichuan2 | link | link |
| InternLM | link | link |
| Qwen | link | link |
| Qwen1.5 | link | link |
| Qwen-VL | link | link |
| Aquila | link | link |
| Aquila2 | link | link |
| MOSS | link | |
| Whisper | link | link |
| Phi-1_5 | link | link |
| Flan-t5 | link | link |
| LLaVA | link | link |
| CodeLlama | link | link |
| Skywork | link | |
| InternLM-XComposer | link | |
| WizardCoder-Python | link | |
| CodeShell | link | |
| Fuyu | link | |
| Distil-Whisper | link | link |
| Yi | link | link |
| BlueLM | link | link |
| Mamba | link | link |
| SOLAR | link | link |
| Phixtral | link | link |
| InternLM2 | link | link |
| RWKV4 | link | |
| RWKV5 | link | |
| Bark | link | link |
| SpeechT5 | link | |
| DeepSeek-MoE | link | |
| Ziya-Coding-34B-v1.0 | link | |
| Phi-2 | link | link |
| Yuan2 | link | link |
| Gemma | link | link |
| DeciLM-7B | link | link |
| Deepseek | link | link |
| StableLM | link | link |
Get Support
- Please report a bug or raise a feature request by opening a Github Issue
- Please report a vulnerability by opening a draft GitHub Security Advisory