LocalAI
LocalAI copied to clipboard
Documentation: CLBlast support in Kubernetes, to enable AMD and Intel iGPU
Spawned off of #404
This is a runbook for enabling clblast in kubernetes, and can be applied to docker as well with a bit of work. This will enable AMD GPUs and Intel iGPUs.
The main steps that need to be done are:
- Enable GPU Passthrough
- Install OpenCL drivers and clblas
- Configure GPU offloading
- Set Build_Type and env variables
To jump to the end, this is a working helm release for LocalAI which contains everything except step 1: https://github.com/lenaxia/home-ops-prod/blob/5039ba39489347e2753e7a333d53664dc3f8daf7/cluster/apps/home/localai/app/helm-release.yaml
Step 1: Enable GPU passthrough (Intel iGPU)
This is done through three helm releases which combine to automatically identify what features are available on a given node and label each node accordingly. In the case of Intel iGPUs, it also enables resource requests for GPU. If you have other ways of tagging your nodes with GPU resources, then that should work too.
- Node Feature Discovery https://github.com/lenaxia/home-ops-prod/blob/071b3f83dcd5934d614a42e5df529a6674737703/cluster/apps/kube-system/node-feature-discovery/app/helm-release.yaml
- Intel Device Plugin Operator https://github.com/lenaxia/home-ops-prod/blob/d567c5101954c934f0dd8a94c73ec209555ad15a/cluster/apps/kube-system/intel-device-plugin/app/helm-release.yaml
- Intel Device Plugin GPU https://github.com/lenaxia/home-ops-prod/blob/2162ef0b7b4df3d02f86ad50c997c9291a5dd478/cluster/apps/kube-system/intel-device-plugin/gpu/helm-release.yaml
Step 2: Install OpenCL drivers and clblas
Reference the latest Intel OpenCL driver and installation instructions here: https://github.com/intel/compute-runtime/releases
In order to get your helm release to automatically install these drivers, you can utilize the pod lifecycle postStart option:
lifecycle:
postStart:
exec:
command:
- /bin/bash
- -c
- >
apt install libclblast-dev -y &&
mkdir /tmp/neo &&
cd /tmp/neo &&
wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.13700.14/intel-igc-core_1.0.13700.14_amd64.deb &&
wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.13700.14/intel-igc-opencl_1.0.13700.14_amd64.deb &&
wget https://github.com/intel/compute-runtime/releases/download/23.13.26032.30/intel-level-zero-gpu-dbgsym_1.3.26032.30_amd64.ddeb &&
wget https://github.com/intel/compute-runtime/releases/download/23.13.26032.30/intel-level-zero-gpu_1.3.26032.30_amd64.deb &&
wget https://github.com/intel/compute-runtime/releases/download/23.13.26032.30/intel-opencl-icd-dbgsym_23.13.26032.30_amd64.ddeb &&
wget https://github.com/intel/compute-runtime/releases/download/23.13.26032.30/intel-opencl-icd_23.13.26032.30_amd64.deb &&
wget https://github.com/intel/compute-runtime/releases/download/23.13.26032.30/libigdgmm12_22.3.0_amd64.deb &&
dpkg -i *.deb
Step 3: Configure GPU offloading
As defined here: https://github.com/go-skynet/LocalAI/blob/cdf0a6e7667e1fb3412951f078aaf017a6fd6437/api/config.go#L35, each model should contain a gpu_layers
configuration that defines how many layers should be offloaded.
In the case of Vicuna, the Model Library yaml can be found here: https://raw.githubusercontent.com/go-skynet/model-gallery/main/vicuna.yaml. Under config_file
, add a reference to this:
name: "vicuna"
description: |
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality
license: "LLaMA"
urls:
- https://github.com/lm-sys/FastChat
config_file: |
backend: llama
parameters:
model: vicuna
top_k: 80
temperature: 0.2
top_p: 0.7
context_size: 1024
template:
completion: vicuna-completion
chat: vicuna-chat
gpu_layers: 32
prompt_templates:
- name: "vicuna-completion"
content: |
{{.Input}}
- name: "vicuna-chat"
content: |
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{{.Input}}
### Response:
Step 4: Environment Variables
Set both BUILD_TYPE
and LLAMA_CLBLAST
in order to ensure that the pod is built with CLBLAST support
env:
- name: BUILD_TYPE
value: clblas
- name: LLAMA_CLBLAST
value: 1
Run a query
In order to verify that the GPU is being used, check your pod logs, and you should see something along the following lines:
ggml_opencl: selecting platform: 'Intel(R) OpenCL HD Graphics'
ggml_opencl: selecting device: 'Intel(R) HD Graphics 530'
ggml_opencl: device FP16 support: true
llama_model_load_internal: mem required = 10583.26 MB (+ 3216.00 MB per state)
ggml_opencl: offloading 32 layers to GPU
ggml_opencl: total VRAM used: 6655 MB