ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Ma...

[!IMPORTANT] bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.


💫 IPEX-LLM

IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency[^1].

[!NOTE]

  • It is built on top of Intel Extension for PyTorch (IPEX), as well as the excellent work of llama.cpp, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.
  • It provides seamless integration with llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
  • 50+ models have been optimized/verified on ipex-llm (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.

ipex-llm Demo

See the demo of running Text-Generation-WebUI, local RAG using LangChain-Chatchat, llama.cpp and HuggingFace transformers (on either Intel Core Ultra laptop or Arc GPU) with ipex-llm below.

Intel Core Ultra Laptop Intel Arc GPU
Text-Generation-WebUI Local RAG using LangChain-Chatchat llama.cpp HuggingFace transformers

Latest Update 🔥

  • [2024/04] ipex-llm now provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU.
  • [2024/03] bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.
  • [2024/02] ipex-llm now supports directly loading model from ModelScope (魔搭).
  • [2024/02] ipex-llm added initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
  • [2024/02] Users can now use ipex-llm through Text-Generation-WebUI GUI.
  • [2024/02] ipex-llm now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.
  • [2024/02] ipex-llm now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).
  • [2024/01] Using ipex-llm QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).
More updates
  • [2023/12] ipex-llm now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates").
  • [2023/12] ipex-llm now supports Mixtral-8x7B on both Intel GPU and CPU.
  • [2023/12] ipex-llm now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models").
  • [2023/12] ipex-llm now supports FP8 and FP4 inference on Intel GPU.
  • [2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into ipex-llm is available.
  • [2023/11] ipex-llm now supports vLLM continuous batching on both Intel GPU and CPU.
  • [2023/10] ipex-llm now supports QLoRA finetuning on both Intel GPU and CPU.
  • [2023/10] ipex-llm now supports FastChat serving on on both Intel CPU and GPU.
  • [2023/09] ipex-llm now supports Intel GPU (including iGPU, Arc, Flex and MAX).
  • [2023/09] ipex-llm tutorial is released.

[^1]: Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.

ipex-llm Quickstart

Install ipex-llm

  • Windows GPU: installing ipex-llm on Windows with Intel GPU
  • Linux GPU: installing ipex-llm on Linux with Intel GPU
  • Docker: using ipex-llm dockers on Intel CPU and GPU
  • For more details, please refer to the installation guide

Run ipex-llm

  • llama.cpp: running llama.cpp (using C++ interface of ipex-llm as an accelerated backend for llama.cpp) on Intel GPU
  • ollama: running ollama (using C++ interface of ipex-llm as an accelerated backend for ollama) on Intel GPU
  • vLLM: running ipex-llm in vLLM on both Intel GPU and CPU
  • FastChat: running ipex-llm in FastChat serving on on both Intel GPU and CPU
  • LangChain-Chatchat RAG: running ipex-llm in LangChain-Chatchat (Knowledge Base QA using RAG pipeline)
  • Text-Generation-WebUI: running ipex-llm in oobabooga WebUI
  • Benchmarking: running (latency and throughput) benchmarks for ipex-llm on Intel CPU and GPU

Code Examples

  • Low bit inference
    • INT4 inference: INT4 LLM inference on Intel GPU and CPU
    • FP8/FP4 inference: FP8 and FP4 LLM inference on Intel GPU
    • INT8 inference: INT8 LLM inference on Intel GPU and CPU
    • INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
  • FP16/BF16 inference
  • Save and load
    • Low-bit models: saving and loading ipex-llm low-bit models
    • GGUF: directly loading GGUF models into ipex-llm
    • AWQ: directly loading AWQ models into ipex-llm
    • GPTQ: directly loading GPTQ models into ipex-llm
  • Finetuning
    • LLM finetuning on Intel GPU, including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA
    • QLoRA finetuning on Intel CPU
  • Integration with community libraries
    • HuggingFace tansformers
    • Standard PyTorch model
    • DeepSpeed-AutoTP
    • HuggingFace PEFT
    • HuggingFace TRL
    • LangChain
    • LlamaIndex
    • AutoGen
    • ModeScope
  • Tutorials

For more details, please refer to the ipex-llm document website.

Verified Models

Over 50 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.

Model CPU Example GPU Example
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) link1, link2 link
LLaMA 2 link1, link2 link
ChatGLM link
ChatGLM2 link link
ChatGLM3 link link
Mistral link link
Mixtral link link
Falcon link link
MPT link link
Dolly-v1 link link
Dolly-v2 link link
Replit Code link link
RedPajama link1, link2
Phoenix link1, link2
StarCoder link1, link2 link
Baichuan link link
Baichuan2 link link
InternLM link link
Qwen link link
Qwen1.5 link link
Qwen-VL link link
Aquila link link
Aquila2 link link
MOSS link
Whisper link link
Phi-1_5 link link
Flan-t5 link link
LLaVA link link
CodeLlama link link
Skywork link
InternLM-XComposer link
WizardCoder-Python link
CodeShell link
Fuyu link
Distil-Whisper link link
Yi link link
BlueLM link link
Mamba link link
SOLAR link link
Phixtral link link
InternLM2 link link
RWKV4 link
RWKV5 link
Bark link link
SpeechT5 link
DeepSeek-MoE link
Ziya-Coding-34B-v1.0 link
Phi-2 link link
Yuan2 link link
Gemma link link
DeciLM-7B link link
Deepseek link link
StableLM link link

Get Support