Is your feature request related to a problem? Please describe. Having defaults high number of GPU layers doesn't always work. For instance big models can overfit the card and constrain the user to configure gpu_layers manually

Describe the solution you'd like With libraries like https://github.com/gpustack/gguf-parser-go we could get along and identify beforeahead how much gpu vram could be used and adjust the default settings

Describe alternatives you've considered Keep things as is

Additional context

Sep 13 '24 19:09 mudler

@mudler happy to take this task and work on it. I have to think a bit on the approach or google on alternatives

Oct 01 '24 18:10 siddimore

@mudler rough design/thoughts on the addition of this feature. Chat-GPT generated markdown for the solution

Design Document: Optimizing GPU Layer Configuration in LocalAI Using gguf-parser

Overview

Rough solution to optimize GPU layer configuration when using LocalAI for running large models, such as Qwen2-72B-Instruct-GGUF. The optimization leverages the gguf-parser library to dynamically adjust GPU memory usage based on the model's requirements and the available hardware resources.

Problem Statement

Large models like Qwen2-72B-Instruct-GGUF can easily exceed the VRAM capacity of a single GPU, requiring manual tuning of GPU layers to fit the model within the available memory. Overfitting the GPU with layers can lead to reduced performance, especially on systems with limited GPU memory.

Solution Approach

Dynamically adjust the GPU layer configuration based on the model metadata provided by gguf-parser. This approach will allow us to:

Estimate VRAM usage and distribute model layers across multiple GPUs.
Offload layers between system memory and GPU memory if necessary.
Ensure optimal performance without manual intervention.

Key Features

VRAM Estimation: Use gguf-parser to estimate GPU memory requirements.
Dynamic Layer Distribution: Use --tensor-split and --rpc flags to distribute layers across multiple GPUs and servers .
Batch Size Adjustment: Adjust the batch size to fit within available memory while maintaining performance.
Flash Attention Tuning: Enable/disable flash attention based on hardware capabilities to optimize performance.

Components

LocalAI Instance:
- Runs the model using the optimized GPU configuration.
- Distributes layers across multiple GPUs based on VRAM estimation.
gguf-parser Integration:
- Parses the model metadata to provide the following details:
  - VRAM requirement per GPU
  - Layer distribution for both local and remote GPUs
  - Batch size and context length
  - Offloading support (RAM usage for system memory)
Layer Distribution and Offloading Logic:
- Adjusts the number of GPU layers dynamically based on the VRAM and RPC flags.
- If the VRAM exceeds the capacity, offloads the excess to system memory or distributes it across multiple GPUs.

Reference

https://github.com/ggerganov/llama.cpp/blob/a39ab216aa624308fda7fa84439c6b61dc98b87a/examples/main/README.md#L318
https://github.com/gpustack/gguf-parser-go

Workflow

Model Parsing with gguf-parser:
- Retrieve model metadata using gguf-parser.
```
gguf-parser --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"
```
- Key metrics:
  - Model size: 72.71 B parameters (~59.92 GiB)
  - VRAM requirement: 73.47 GiB
  - Transformer layers: 80 layers
  - Supported flags: --tensor-split, --rpc
  - Offloading capability: Unsupported for distribution inference
VRAM and Layer Adjustment:
- Compare the model's VRAM requirement with the available VRAM on the system.
- If the model exceeds the VRAM limit, adjust the number of layers or distribute them across multiple GPUs using --tensor-split.
- Example command to split the model across two GPUs:
```
local-ai --tensor-split="0:50,1:30" --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"
```
Batch Size and Context Length Adjustment:
- The batch size recommended for this model is 2048 / 512 tokens.
- Dynamically adjust the batch size based on available memory to prevent memory overrun:
```
local-ai --batch-size=512 --ctx-size=32768 --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"
```

Estimation Process

VRAM Estimation (Per GPU)

Using gguf-parser, the following memory requirements extracted:

VRAM for one GPU: 73.47 GiB (full model on a single GPU)
RAM Offload: 441.38 MiB can be used for offloading parts of the model to system memory.

Tensor Split for Multi-GPU Setup

The model can be distributed across multiple GPUs using the --tensor-split flag:

Example: 50% of the model layers on GPU 0 and 30% on GPU 1.

local-ai --tensor-split="0:50,1:30" --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"

Oct 02 '24 17:10 siddimore

@mudler @sozercan Some more context now i have a working prototype for parsing gguf models.

Using GGUF-Parser i have following output for model: Meta Llama Meta Llama 3.1 405B Instruct

RAM and VRAM estimates: { "estimate": { "items": [ { "offloadLayers": 127, "fullOffloaded": true, "ram": { "uma": 364586936, "nonuma": 521873336 }, "vram": [ { "uma": 16106705920, "nonuma": 34165613568 } ] } ], "type": "model", "architecture": "llama", "contextSize": 131072, "flashAttention": false, "noMMap": false, "embeddingOnly": false, "distributable": true, "logicalBatchSize": 2048, "physicalBatchSize": 512 }, "architecture": { "type": "model", "architecture": "llama", "maximumContextLength": 131072, "embeddingLength": 16384, "blockCount": 126, "feedForwardLength": 53248, "attentionHeadCount": 128, "attentionHeadCountKV": 16, "attentionLayerNormRMSEpsilon": 0.00001, "attentionKeyLength": 128, "attentionValueLength": 128, "attentionCausal": true, "ropeDimensionCount": 128, "ropeFrequencyBase": 500000, "vocabularyLength": 128256, "embeddingGQA": 8, "embeddingKeyGQA": 2048, "embeddingValueGQA": 2048 }, "metadata": { "type": "model", "architecture": "llama", "quantizationVersion": 2, "alignment": 32, "name": "Models Meta Llama Meta Llama 3.1 405B Instruct", "license": "llama3.1", "fileType": 10, "littleEndian": true, "fileSize": 17239928096, "size": 17232101376, "parameters": 47232516096, "bitsPerWeight": 2.9186844657567317 }, "tokenizer": { "model": "gpt2", "tokensLength": 128256, "mergesLength": 280147, "addedTokenLength": 0, "bosTokenID": 128000, "eosTokenID": -1, "eotTokenID": -1, "eomTokenID": -1, "unknownTokenID": -1, "separatorTokenID": -1, "paddingTokenID": -1, "tokensSize": 2099452, "mergesSize": 5204765 }

Based on the above values rough math for a machine with 10GB VRAM, GPU_Layers comes out to 37 layers, which can then be set in localAI as a parameter to pass down to llamacpp. Screenshot 2024-10-03 at 11 11 01 PM

Oct 04 '24 06:10 siddimore

@mudler @sozercan Some more context now i have a working prototype for parsing gguf models.

Using GGUF-Parser i have following output for model: Meta Llama Meta Llama 3.1 405B Instruct

RAM and VRAM estimates: { "estimate": { "items": [ { "offloadLayers": 127, "fullOffloaded": true, "ram": { "uma": 364586936, "nonuma": 521873336 }, "vram": [ { "uma": 16106705920, "nonuma": 34165613568 } ] } ], "type": "model", "architecture": "llama", "contextSize": 131072, "flashAttention": false, "noMMap": false, "embeddingOnly": false, "distributable": true, "logicalBatchSize": 2048, "physicalBatchSize": 512 }, "architecture": { "type": "model", "architecture": "llama", "maximumContextLength": 131072, "embeddingLength": 16384, "blockCount": 126, "feedForwardLength": 53248, "attentionHeadCount": 128, "attentionHeadCountKV": 16, "attentionLayerNormRMSEpsilon": 0.00001, "attentionKeyLength": 128, "attentionValueLength": 128, "attentionCausal": true, "ropeDimensionCount": 128, "ropeFrequencyBase": 500000, "vocabularyLength": 128256, "embeddingGQA": 8, "embeddingKeyGQA": 2048, "embeddingValueGQA": 2048 }, "metadata": { "type": "model", "architecture": "llama", "quantizationVersion": 2, "alignment": 32, "name": "Models Meta Llama Meta Llama 3.1 405B Instruct", "license": "llama3.1", "fileType": 10, "littleEndian": true, "fileSize": 17239928096, "size": 17232101376, "parameters": 47232516096, "bitsPerWeight": 2.9186844657567317 }, "tokenizer": { "model": "gpt2", "tokensLength": 128256, "mergesLength": 280147, "addedTokenLength": 0, "bosTokenID": 128000, "eosTokenID": -1, "eotTokenID": -1, "eomTokenID": -1, "unknownTokenID": -1, "separatorTokenID": -1, "paddingTokenID": -1, "tokensSize": 2099452, "mergesSize": 5204765 }

Based on the above values rough math for a machine with 10GB VRAM, GPU_Layers comes out to 37 layers, which can then be set in localAI as a parameter to pass down to llamacpp.

that sounds likely the good direction - would be cool now to instrument the library from the code, and set the GPU layers in the model defaults accordingly https://github.com/mudler/LocalAI/blob/04c0841ca9e085dfd835b16684a8b82e57232068/core/config/backend_config.go#L291

Oct 04 '24 16:10 mudler

LocalAI
LocalAI copied to clipboard

feat: automatically adjust default gpu_layers by available GPU memory

Design Document: Optimizing GPU Layer Configuration in LocalAI Using gguf-parser

Overview

Problem Statement

Solution Approach

Key Features

Components

Reference

Workflow

Estimation Process

VRAM Estimation (Per GPU)

Tensor Split for Multi-GPU Setup

LocalAI LocalAI copied to clipboard

feat: automatically adjust default gpu_layers by available GPU memory

Design Document: Optimizing GPU Layer Configuration in LocalAI Using gguf-parser

Overview

Problem Statement

Solution Approach

Key Features

Components

Reference

Workflow

Estimation Process

VRAM Estimation (Per GPU)

Tensor Split for Multi-GPU Setup

LocalAI
LocalAI copied to clipboard