LocalAI icon indicating copy to clipboard operation
LocalAI copied to clipboard

Documentation: CLBlast support in Kubernetes, to enable AMD and Intel iGPU

Open lenaxia opened this issue 1 year ago • 0 comments

Spawned off of #404

This is a runbook for enabling clblast in kubernetes, and can be applied to docker as well with a bit of work. This will enable AMD GPUs and Intel iGPUs.

The main steps that need to be done are:

  1. Enable GPU Passthrough
  2. Install OpenCL drivers and clblas
  3. Configure GPU offloading
  4. Set Build_Type and env variables

To jump to the end, this is a working helm release for LocalAI which contains everything except step 1: https://github.com/lenaxia/home-ops-prod/blob/5039ba39489347e2753e7a333d53664dc3f8daf7/cluster/apps/home/localai/app/helm-release.yaml

Step 1: Enable GPU passthrough (Intel iGPU)

This is done through three helm releases which combine to automatically identify what features are available on a given node and label each node accordingly. In the case of Intel iGPUs, it also enables resource requests for GPU. If you have other ways of tagging your nodes with GPU resources, then that should work too.

  • Node Feature Discovery https://github.com/lenaxia/home-ops-prod/blob/071b3f83dcd5934d614a42e5df529a6674737703/cluster/apps/kube-system/node-feature-discovery/app/helm-release.yaml
  • Intel Device Plugin Operator https://github.com/lenaxia/home-ops-prod/blob/d567c5101954c934f0dd8a94c73ec209555ad15a/cluster/apps/kube-system/intel-device-plugin/app/helm-release.yaml
  • Intel Device Plugin GPU https://github.com/lenaxia/home-ops-prod/blob/2162ef0b7b4df3d02f86ad50c997c9291a5dd478/cluster/apps/kube-system/intel-device-plugin/gpu/helm-release.yaml

Step 2: Install OpenCL drivers and clblas

Reference the latest Intel OpenCL driver and installation instructions here: https://github.com/intel/compute-runtime/releases

In order to get your helm release to automatically install these drivers, you can utilize the pod lifecycle postStart option:

    lifecycle:
      postStart:
        exec:
          command:
            - /bin/bash
            - -c
            - >
              apt install libclblast-dev -y &&
              mkdir /tmp/neo &&
              cd /tmp/neo &&
              wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.13700.14/intel-igc-core_1.0.13700.14_amd64.deb &&
              wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.13700.14/intel-igc-opencl_1.0.13700.14_amd64.deb &&
              wget https://github.com/intel/compute-runtime/releases/download/23.13.26032.30/intel-level-zero-gpu-dbgsym_1.3.26032.30_amd64.ddeb &&
              wget https://github.com/intel/compute-runtime/releases/download/23.13.26032.30/intel-level-zero-gpu_1.3.26032.30_amd64.deb &&
              wget https://github.com/intel/compute-runtime/releases/download/23.13.26032.30/intel-opencl-icd-dbgsym_23.13.26032.30_amd64.ddeb &&
              wget https://github.com/intel/compute-runtime/releases/download/23.13.26032.30/intel-opencl-icd_23.13.26032.30_amd64.deb &&
              wget https://github.com/intel/compute-runtime/releases/download/23.13.26032.30/libigdgmm12_22.3.0_amd64.deb &&
              dpkg -i *.deb

Step 3: Configure GPU offloading

As defined here: https://github.com/go-skynet/LocalAI/blob/cdf0a6e7667e1fb3412951f078aaf017a6fd6437/api/config.go#L35, each model should contain a gpu_layers configuration that defines how many layers should be offloaded.

In the case of Vicuna, the Model Library yaml can be found here: https://raw.githubusercontent.com/go-skynet/model-gallery/main/vicuna.yaml. Under config_file, add a reference to this:

name: "vicuna"

description: |
    Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality

license: "LLaMA"
urls:
- https://github.com/lm-sys/FastChat

config_file: |
    backend: llama
    parameters:
      model: vicuna
      top_k: 80
      temperature: 0.2
      top_p: 0.7
    context_size: 1024
    template:
      completion: vicuna-completion
      chat: vicuna-chat
    gpu_layers: 32

prompt_templates:
- name: "vicuna-completion"
  content: |
      {{.Input}}

- name: "vicuna-chat"
  content: |
    Below is an instruction that describes a task. Write a response that appropriately completes the request.

    ### Instruction:
    {{.Input}}

    ### Response:

Step 4: Environment Variables

Set both BUILD_TYPE and LLAMA_CLBLAST in order to ensure that the pod is built with CLBLAST support

    env:
    - name: BUILD_TYPE
      value: clblas
    - name: LLAMA_CLBLAST
      value: 1

Run a query

In order to verify that the GPU is being used, check your pod logs, and you should see something along the following lines:

ggml_opencl: selecting platform: 'Intel(R) OpenCL HD Graphics'
ggml_opencl: selecting device: 'Intel(R) HD Graphics 530'
ggml_opencl: device FP16 support: true
llama_model_load_internal: mem required  = 10583.26 MB (+ 3216.00 MB per state)
ggml_opencl: offloading 32 layers to GPU
ggml_opencl: total VRAM used: 6655 MB

lenaxia avatar Jun 05 '23 00:06 lenaxia