LocalAI
LocalAI copied to clipboard
feat: automatically adjust default gpu_layers by available GPU memory
Is your feature request related to a problem? Please describe.
Having defaults high number of GPU layers doesn't always work. For instance big models can overfit the card and constrain the user to configure gpu_layers manually
Describe the solution you'd like With libraries like https://github.com/gpustack/gguf-parser-go we could get along and identify beforeahead how much gpu vram could be used and adjust the default settings
Describe alternatives you've considered Keep things as is
Additional context
@mudler happy to take this task and work on it. I have to think a bit on the approach or google on alternatives
@mudler rough design/thoughts on the addition of this feature. Chat-GPT generated markdown for the solution
Design Document: Optimizing GPU Layer Configuration in LocalAI Using gguf-parser
Overview
Rough solution to optimize GPU layer configuration when using LocalAI for running large models, such as Qwen2-72B-Instruct-GGUF. The optimization leverages the gguf-parser library to dynamically adjust GPU memory usage based on the model's requirements and the available hardware resources.
Problem Statement
Large models like Qwen2-72B-Instruct-GGUF can easily exceed the VRAM capacity of a single GPU, requiring manual tuning of GPU layers to fit the model within the available memory. Overfitting the GPU with layers can lead to reduced performance, especially on systems with limited GPU memory.
Solution Approach
Dynamically adjust the GPU layer configuration based on the model metadata provided by gguf-parser. This approach will allow us to:
- Estimate VRAM usage and distribute model layers across multiple GPUs.
- Offload layers between system memory and GPU memory if necessary.
- Ensure optimal performance without manual intervention.
Key Features
- VRAM Estimation: Use
gguf-parserto estimate GPU memory requirements. - Dynamic Layer Distribution: Use
--tensor-splitand--rpcflags to distribute layers across multiple GPUs and servers . - Batch Size Adjustment: Adjust the batch size to fit within available memory while maintaining performance.
- Flash Attention Tuning: Enable/disable flash attention based on hardware capabilities to optimize performance.
Components
-
LocalAI Instance:
- Runs the model using the optimized GPU configuration.
- Distributes layers across multiple GPUs based on VRAM estimation.
-
gguf-parser Integration:
- Parses the model metadata to provide the following details:
- VRAM requirement per GPU
- Layer distribution for both local and remote GPUs
- Batch size and context length
- Offloading support (RAM usage for system memory)
- Parses the model metadata to provide the following details:
-
Layer Distribution and Offloading Logic:
- Adjusts the number of GPU layers dynamically based on the VRAM and RPC flags.
- If the VRAM exceeds the capacity, offloads the excess to system memory or distributes it across multiple GPUs.
Reference
- https://github.com/ggerganov/llama.cpp/blob/a39ab216aa624308fda7fa84439c6b61dc98b87a/examples/main/README.md#L318
- https://github.com/gpustack/gguf-parser-go
Workflow
-
Model Parsing with gguf-parser:
- Retrieve model metadata using
gguf-parser.gguf-parser --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf" - Key metrics:
- Model size: 72.71 B parameters (~59.92 GiB)
- VRAM requirement: 73.47 GiB
- Transformer layers: 80 layers
- Supported flags:
--tensor-split,--rpc - Offloading capability: Unsupported for distribution inference
- Retrieve model metadata using
-
VRAM and Layer Adjustment:
- Compare the model's VRAM requirement with the available VRAM on the system.
- If the model exceeds the VRAM limit, adjust the number of layers or distribute them across multiple GPUs using
--tensor-split. - Example command to split the model across two GPUs:
local-ai --tensor-split="0:50,1:30" --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"
-
Batch Size and Context Length Adjustment:
- The batch size recommended for this model is 2048 / 512 tokens.
- Dynamically adjust the batch size based on available memory to prevent memory overrun:
local-ai --batch-size=512 --ctx-size=32768 --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"
Estimation Process
VRAM Estimation (Per GPU)
Using gguf-parser, the following memory requirements extracted:
- VRAM for one GPU: 73.47 GiB (full model on a single GPU)
- RAM Offload: 441.38 MiB can be used for offloading parts of the model to system memory.
Tensor Split for Multi-GPU Setup
The model can be distributed across multiple GPUs using the --tensor-split flag:
- Example: 50% of the model layers on GPU 0 and 30% on GPU 1.
local-ai --tensor-split="0:50,1:30" --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"
@mudler @sozercan Some more context now i have a working prototype for parsing gguf models.
Using GGUF-Parser i have following output for model: Meta Llama Meta Llama 3.1 405B Instruct
RAM and VRAM estimates:
{ "estimate": { "items": [ { "offloadLayers": 127, "fullOffloaded": true, "ram": { "uma": 364586936, "nonuma": 521873336 }, "vram": [ { "uma": 16106705920, "nonuma": 34165613568 } ] } ], "type": "model", "architecture": "llama", "contextSize": 131072, "flashAttention": false, "noMMap": false, "embeddingOnly": false, "distributable": true, "logicalBatchSize": 2048, "physicalBatchSize": 512 }, "architecture": { "type": "model", "architecture": "llama", "maximumContextLength": 131072, "embeddingLength": 16384, "blockCount": 126, "feedForwardLength": 53248, "attentionHeadCount": 128, "attentionHeadCountKV": 16, "attentionLayerNormRMSEpsilon": 0.00001, "attentionKeyLength": 128, "attentionValueLength": 128, "attentionCausal": true, "ropeDimensionCount": 128, "ropeFrequencyBase": 500000, "vocabularyLength": 128256, "embeddingGQA": 8, "embeddingKeyGQA": 2048, "embeddingValueGQA": 2048 }, "metadata": { "type": "model", "architecture": "llama", "quantizationVersion": 2, "alignment": 32, "name": "Models Meta Llama Meta Llama 3.1 405B Instruct", "license": "llama3.1", "fileType": 10, "littleEndian": true, "fileSize": 17239928096, "size": 17232101376, "parameters": 47232516096, "bitsPerWeight": 2.9186844657567317 }, "tokenizer": { "model": "gpt2", "tokensLength": 128256, "mergesLength": 280147, "addedTokenLength": 0, "bosTokenID": 128000, "eosTokenID": -1, "eotTokenID": -1, "eomTokenID": -1, "unknownTokenID": -1, "separatorTokenID": -1, "paddingTokenID": -1, "tokensSize": 2099452, "mergesSize": 5204765 }
Based on the above values rough math for a machine with 10GB VRAM, GPU_Layers comes out to 37 layers, which can then be set in localAI as a parameter to pass down to llamacpp.
@mudler @sozercan Some more context now i have a working prototype for parsing gguf models.
Using GGUF-Parser i have following output for model:
Meta Llama Meta Llama 3.1 405B InstructRAM and VRAM estimates:
{ "estimate": { "items": [ { "offloadLayers": 127, "fullOffloaded": true, "ram": { "uma": 364586936, "nonuma": 521873336 }, "vram": [ { "uma": 16106705920, "nonuma": 34165613568 } ] } ], "type": "model", "architecture": "llama", "contextSize": 131072, "flashAttention": false, "noMMap": false, "embeddingOnly": false, "distributable": true, "logicalBatchSize": 2048, "physicalBatchSize": 512 }, "architecture": { "type": "model", "architecture": "llama", "maximumContextLength": 131072, "embeddingLength": 16384, "blockCount": 126, "feedForwardLength": 53248, "attentionHeadCount": 128, "attentionHeadCountKV": 16, "attentionLayerNormRMSEpsilon": 0.00001, "attentionKeyLength": 128, "attentionValueLength": 128, "attentionCausal": true, "ropeDimensionCount": 128, "ropeFrequencyBase": 500000, "vocabularyLength": 128256, "embeddingGQA": 8, "embeddingKeyGQA": 2048, "embeddingValueGQA": 2048 }, "metadata": { "type": "model", "architecture": "llama", "quantizationVersion": 2, "alignment": 32, "name": "Models Meta Llama Meta Llama 3.1 405B Instruct", "license": "llama3.1", "fileType": 10, "littleEndian": true, "fileSize": 17239928096, "size": 17232101376, "parameters": 47232516096, "bitsPerWeight": 2.9186844657567317 }, "tokenizer": { "model": "gpt2", "tokensLength": 128256, "mergesLength": 280147, "addedTokenLength": 0, "bosTokenID": 128000, "eosTokenID": -1, "eotTokenID": -1, "eomTokenID": -1, "unknownTokenID": -1, "separatorTokenID": -1, "paddingTokenID": -1, "tokensSize": 2099452, "mergesSize": 5204765 }Based on the above values rough math for a machine with 10GB VRAM, GPU_Layers comes out to 37 layers, which can then be set in localAI as a parameter to pass down to llamacpp.
that sounds likely the good direction - would be cool now to instrument the library from the code, and set the GPU layers in the model defaults accordingly https://github.com/mudler/LocalAI/blob/04c0841ca9e085dfd835b16684a8b82e57232068/core/config/backend_config.go#L291
