Enhance Model Import Flexibility: Support Backend & Quantization Selection Across Sources (HF, Ollama, Files, OCI)
Feature Request: Flexible Model Import with Backend and Quantization Selection
Currently, the model import workflow in LocalAI is somewhat rigid, especially when importing models from Hugging Face (HF), Ollama, local files, or OCI images. Users lack fine-grained control over:
- The choice of backend (e.g.,
vLLM,transformers,llama.cpp,diffusersfor image generation) - The specific quantization (e.g., Q4_K_M, Q5_K_S, GGUF variants)
- Automatic backend detection and template handling
Proposal
Enhance the model import system to support a flexible, user-driven workflow that allows:
-
Source Flexibility:
- Import models directly from Hugging Face (e.g.,
HuggingFace: meta-llama/Llama-3-8b-instruct) - Import from Ollama (e.g.,
Ollama: llama3:instruct) - Load from local files (e.g.,
.gguf,.safetensors) - Pull from OCI images (e.g.,
oci://my-registry.com/my-model:latest)
- Import models directly from Hugging Face (e.g.,
-
Backend and Quantization Selection:
- Allow users to explicitly choose the backend during import
- Provide a list of available quantizations and backends for each model
- Enable automatic detection of suitable backends based on file type (e.g.,
.gguf→llama.cpp)
-
Seamless Integration with Gallery:
- The model gallery should remain lightweight, focusing only on curated "latest and greatest" models
- The import flow should handle complex or niche models, reducing the need to maintain every model in the gallery
-
Auto-Detection of Native Templates:
- When importing a
llama.cpp-compatible model (e.g.,.gguf), detect and use its native chat template from the upstreamllama.cppproject - Fallback to inline template definition if not available (maintaining backward compatibility)
- When importing a
Benefits
- Greater flexibility for advanced users to select optimal backends and quantizations
- Reduced bloat in the model gallery—focus on quality, not quantity
- Improved user experience for importing models from diverse sources
- Better compatibility with upstream standards (e.g.,
llama.cpptemplates)
Example Workflow
- User selects "Import from Hugging Face"
- Enters
meta-llama/Llama-3-8b-instruct - LocalAI lists available quantizations (Q4_K_M, Q5_K_S, etc.) and backends (vLLM, llama.cpp, transformers)
- User selects
llama.cpp+Q4_K_M+auto-apply template - LocalAI downloads the
.gguffile, auto-applies thellama-3chat template fromllama.cpp, and loads the model
This feature would make LocalAI a truly universal model runner, supporting any model, any backend, any quantization—with minimal friction.
New Import flow is available on master:
https://github.com/user-attachments/assets/01f3ed3c-d6d3-48bb-b11e-384c4299c893
Currently works with:
- llama.cpp
- vLLM
- transformers
- MLX
- MLX-VLM
Requirement Description:
When working with model repositories, it's important to note that official model configuration files (including config.json, generation_config.json, and chat_template.jinja) are not universally present across all repositories. Therefore, these configuration elements should be imported separately through a dedicated handling mechanism.
The rationale behind this approach stems from the observation that while original/official repositories typically contain these configuration files, fine-tuned model variants often lack them. This creates a dependency management challenge where:
Official repositories serve as the canonical source for model configurations
Fine-tuned models frequently omit these files to reduce redundancy and repository size
Configuration requirements remain essential for proper model initialization and inference
Best Practice Implementation: The optimal solution involves treating configuration management as a distinct, modular component with the following characteristics:
Separate configuration fetching logic from model weight loading
Implement fallback mechanisms for missing configurations
Maintain a configuration cache to avoid redundant downloads
Provide default configurations for common model architectures
Enable configuration override capabilities for custom implementations
This segregated approach ensures:
Better error handling when configurations are absent
Improved compatibility across original and fine-tuned models
Cleaner separation of concerns in the model loading pipeline
Enhanced maintainability through modular configuration handling
Support for both official and community-distributed model variants
The implementation should prioritize graceful degradation when configuration files are unavailable while maintaining full functionality when present, thus supporting the diverse ecosystem of original and fine-tuned models consistently.