embeddedllm
embeddedllm copied to clipboard
EmbeddedLLM: API server for Embedded Device Deployment. Currently support CUDA/OpenVINO/IpexLLM/DirectML/CPU
EmbeddedLLM
Run local LLMs on iGPU, APU and CPU (AMD , Intel, and Qualcomm (Coming Soon)). Easiest way to launch OpenAI API Compatible Server on Windows, Linux and MacOS
| Support matrix | Supported now | Under Development | On the roadmap |
|---|---|---|---|
| Model architectures | Gemma Llama * Mistral + Phi |
||
| Platform | Linux Windows |
||
| Architecture | x86 x64 |
Arm64 | |
| Hardware Acceleration | CUDA DirectML IpexLLM |
QNN ROCm |
OpenVINO |
* The Llama model architecture supports similar model families such as CodeLlama, Vicuna, Yi, and more.
+ The Mistral model architecture supports similar model families such as Zephyr.
🚀 Latest News
- [2024/06] Support Phi-3 (mini, small, medium), Phi-3-Vision-Mini, Llama-2, Llama-3, Gemma (v1), Mistral v0.3, Starling-LM, Yi-1.5.
- [2024/06] Support vision/chat inference on iGPU, APU, CPU and CUDA.
Table Content
- Supported Models
- Onnxruntime Models
- Ipex-LLM Models
- Getting Started
- Installation From Source
- Launch OpenAI API Compatible Server
- Launch Chatbot Web UI
- Launch Model Management UI
- Compile OpenAI-API Compatible Server into Windows Executable
- Prebuilt Binary (Alpha)
- Acknowledgements
Supported Models (Quick Start)
| Models | Parameters | Context Length | Link |
|---|---|---|---|
| Gemma-2b-Instruct v1 | 2B | 8192 | EmbeddedLLM/gemma-2b-it-onnx |
| Llama-2-7b-chat | 7B | 4096 | EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml |
| Llama-2-13b-chat | 13B | 4096 | EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml |
| Llama-3-8b-chat | 8B | 8192 | EmbeddedLLM/mistral-7b-instruct-v0.3-onnx |
| Mistral-7b-v0.3-instruct | 7B | 32768 | EmbeddedLLM/mistral-7b-instruct-v0.3-onnx |
| Phi-3-mini-4k-instruct-062024 | 3.8B | 4096 | EmbeddedLLM/Phi-3-mini-4k-instruct-062024-onnx |
| Phi3-mini-4k-instruct | 3.8B | 4096 | microsoft/Phi-3-mini-4k-instruct-onnx |
| Phi3-mini-128k-instruct | 3.8B | 128k | microsoft/Phi-3-mini-128k-instruct-onnx |
| Phi3-medium-4k-instruct | 17B | 4096 | microsoft/Phi-3-medium-4k-instruct-onnx-directml |
| Phi3-medium-128k-instruct | 17B | 128k | microsoft/Phi-3-medium-128k-instruct-onnx-directml |
| Openchat-3.6-8b | 8B | 8192 | EmbeddedLLM/openchat-3.6-8b-20240522-onnx |
| Yi-1.5-6b-chat | 6B | 32k | EmbeddedLLM/01-ai_Yi-1.5-6B-Chat-onnx |
| Phi-3-vision-128k-instruct | 128k | EmbeddedLLM/Phi-3-vision-128k-instruct-onnx |
Getting Started
Installation
From Source
-
Windows
- Custom Setup:
- IPEX(XPU): Requires anaconda environment.
conda create -n ellm python=3.10 libuv; conda activate ellm. - DirectML: If you are using Conda Environment. Install additional dependencies:
conda install conda-forge::vs2015_runtime.
-
Install embeddedllm package.
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .. Note: currently supportcpu,directmlandcuda.- DirectML:
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml] - CPU:
$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu] - CUDA:
$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda] - IPEX:
$env:ELLM_TARGET_DEVICE='ipex'; python setup.py develop - OpenVINO:
$env:ELLM_TARGET_DEVICE='openvino'; pip install -e .[openvino] - With Web UI:
- DirectML:
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml,webui] - CPU:
$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu,webui] - CUDA:
$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda,webui] - IPEX:
$env:ELLM_TARGET_DEVICE='ipex'; python setup.py develop; pip install -r requirements-webui.txt - OpenVINO:
$env:ELLM_TARGET_DEVICE='openvino'; pip install -e .[openvino,webui]
- DirectML:
- DirectML:
-
Linux
- Custom Setup:
- IPEX(XPU): Requires anaconda environment.
conda create -n ellm python=3.10 libuv; conda activate ellm. - DirectML: If you are using Conda Environment. Install additional dependencies:
conda install conda-forge::vs2015_runtime.
-
Install embeddedllm package.
ELLM_TARGET_DEVICE='directml' pip install -e .. Note: currently supportcpu,directmlandcuda.- DirectML:
ELLM_TARGET_DEVICE='directml' pip install -e .[directml] - CPU:
ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu] - CUDA:
ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda] - IPEX:
ELLM_TARGET_DEVICE='ipex' python setup.py develop - OpenVINO:
ELLM_TARGET_DEVICE='openvino' pip install -e .[openvino] - With Web UI:
- DirectML:
ELLM_TARGET_DEVICE='directml' pip install -e .[directml,webui] - CPU:
ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu,webui] - CUDA:
ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda,webui] - IPEX:
ELLM_TARGET_DEVICE='ipex' python setup.py develop; pip install -r requirements-webui.txt - OpenVINO:
ELLM_TARGET_DEVICE='openvino' pip install -e .[openvino,webui]
- DirectML:
- DirectML:
Launch OpenAI API Compatible Server
-
Custom Setup:
-
Ipex
-
For Intel iGPU:
set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 -
For Intel Arc™ A-Series Graphics:
set SYCL_CACHE_PERSISTENT=1
-
-
-
ellm_server --model_path <path/to/model/weight>. -
Example code to connect to the api server can be found in
scripts/python. Note: To find out more of the supported arguments.ellm_server --help.
Launch Chatbot Web UI
ellm_chatbot --port 7788 --host localhost --server_port <ellm_server_port> --server_host localhost. Note: To find out more of the supported arguments.ellm_chatbot --help.

Launch Model Management UI
It is an interface that allows you to download and deploy OpenAI API compatible server. You can find out the disk space required to download the model in the UI.
ellm_modelui --port 6678. Note: To find out more of the supported arguments.ellm_modelui --help.

Compile OpenAI-API Compatible Server into Windows Executable
NOTE: OpenVINO packaging currently uses torch==2.4.0. It will not be able to run due to missing dependencies which is libomp. Make sure to install libomp and add the libomp-xxxxxxx.dll to C:\\Windows\\System32.
-
Install
embeddedllm. -
Install PyInstaller:
pip install pyinstaller==6.9.0. -
Compile Windows Executable:
pyinstaller .\ellm_api_server.spec. -
You can find the executable in the
dist\ellm_api_server. -
Use it like
ellm_server..\ellm_api_server.exe --model_path <path/to/model/weight>.Powershell/Terminal Usage:
ellm_server --model_path <path/to/model/weight> # DirectML ellm_server --model_path 'EmbeddedLLM/Phi-3-mini-4k-instruct-onnx-directml' --port 5555 # IPEX-LLM ellm_server --model_path '.\meta-llama_Meta-Llama-3.1-8B-Instruct\' --backend 'ipex' --device 'xpu' --port 5555 --served_model_name 'meta-llama_Meta/Llama-3.1-8B-Instruct' # OpenVINO ellm_server --model_path '.\meta-llama_Meta-Llama-3.1-8B-Instruct\' --backend 'openvino' --device 'gpu' --port 5555 --served_model_name 'meta-llama_Meta/Llama-3.1-8B-Instruct'
Prebuilt OpenAI API Compatible Windows Executable (Alpha)
You can find the prebuilt OpenAI API Compatible Windows Executable in the Release page.
Powershell/Terminal Usage (Use it like ellm_server):
.\ellm_api_server.exe --model_path <path/to/model/weight>
# DirectML
.\ellm_api_server.exe --model_path 'EmbeddedLLM_Phi-3-mini-4k-instruct-062024-onnx\onnx\directml\Phi-3-mini-4k-instruct-062024-int4' --port 5555
# IPEX-LLM
.\ellm_api_server.exe --model_path '.\meta-llama_Meta-Llama-3.1-8B-Instruct\' --backend 'ipex' --device 'xpu' --port 5555 --served_model_name 'meta-llama_Meta/Llama-3.1-8B-Instruct'
# OpenVINO
.\ellm_api_server.exe --model_path '.\meta-llama_Meta-Llama-3.1-8B-Instruct\' --backend 'openvino' --device 'gpu' --port 5555 --served_model_name 'meta-llama_Meta/Llama-3.1-8B-Instruct'
Acknowledgements
- Excellent open-source projects: vLLM, onnxruntime-genai, Ipex-LLM and many others.