LocalAI icon indicating copy to clipboard operation
LocalAI copied to clipboard

Enhance Model Import Flexibility: Support Backend & Quantization Selection Across Sources (HF, Ollama, Files, OCI)

Open localai-bot opened this issue 2 months ago • 2 comments

Feature Request: Flexible Model Import with Backend and Quantization Selection

Currently, the model import workflow in LocalAI is somewhat rigid, especially when importing models from Hugging Face (HF), Ollama, local files, or OCI images. Users lack fine-grained control over:

  • The choice of backend (e.g., vLLM, transformers, llama.cpp, diffusers for image generation)
  • The specific quantization (e.g., Q4_K_M, Q5_K_S, GGUF variants)
  • Automatic backend detection and template handling

Proposal

Enhance the model import system to support a flexible, user-driven workflow that allows:

  1. Source Flexibility:

    • Import models directly from Hugging Face (e.g., HuggingFace: meta-llama/Llama-3-8b-instruct)
    • Import from Ollama (e.g., Ollama: llama3:instruct)
    • Load from local files (e.g., .gguf, .safetensors)
    • Pull from OCI images (e.g., oci://my-registry.com/my-model:latest)
  2. Backend and Quantization Selection:

    • Allow users to explicitly choose the backend during import
    • Provide a list of available quantizations and backends for each model
    • Enable automatic detection of suitable backends based on file type (e.g., .ggufllama.cpp)
  3. Seamless Integration with Gallery:

    • The model gallery should remain lightweight, focusing only on curated "latest and greatest" models
    • The import flow should handle complex or niche models, reducing the need to maintain every model in the gallery
  4. Auto-Detection of Native Templates:

    • When importing a llama.cpp-compatible model (e.g., .gguf), detect and use its native chat template from the upstream llama.cpp project
    • Fallback to inline template definition if not available (maintaining backward compatibility)

Benefits

  • Greater flexibility for advanced users to select optimal backends and quantizations
  • Reduced bloat in the model gallery—focus on quality, not quantity
  • Improved user experience for importing models from diverse sources
  • Better compatibility with upstream standards (e.g., llama.cpp templates)

Example Workflow

  1. User selects "Import from Hugging Face"
  2. Enters meta-llama/Llama-3-8b-instruct
  3. LocalAI lists available quantizations (Q4_K_M, Q5_K_S, etc.) and backends (vLLM, llama.cpp, transformers)
  4. User selects llama.cpp + Q4_K_M + auto-apply template
  5. LocalAI downloads the .gguf file, auto-applies the llama-3 chat template from llama.cpp, and loads the model

This feature would make LocalAI a truly universal model runner, supporting any model, any backend, any quantization—with minimal friction.

localai-bot avatar Nov 05 '25 14:11 localai-bot

New Import flow is available on master:

https://github.com/user-attachments/assets/01f3ed3c-d6d3-48bb-b11e-384c4299c893

Currently works with:

  • llama.cpp
  • vLLM
  • transformers
  • MLX
  • MLX-VLM

mudler avatar Nov 18 '25 11:11 mudler

Requirement Description:

When working with model repositories, it's important to note that official model configuration files (including config.json, generation_config.json, and chat_template.jinja) are not universally present across all repositories. Therefore, these configuration elements should be imported separately through a dedicated handling mechanism.

The rationale behind this approach stems from the observation that while original/official repositories typically contain these configuration files, fine-tuned model variants often lack them. This creates a dependency management challenge where:

Official repositories serve as the canonical source for model configurations

Fine-tuned models frequently omit these files to reduce redundancy and repository size

Configuration requirements remain essential for proper model initialization and inference

Best Practice Implementation: The optimal solution involves treating configuration management as a distinct, modular component with the following characteristics:

Separate configuration fetching logic from model weight loading

Implement fallback mechanisms for missing configurations

Maintain a configuration cache to avoid redundant downloads

Provide default configurations for common model architectures

Enable configuration override capabilities for custom implementations

This segregated approach ensures:

Better error handling when configurations are absent

Improved compatibility across original and fine-tuned models

Cleaner separation of concerns in the model loading pipeline

Enhanced maintainability through modular configuration handling

Support for both official and community-distributed model variants

The implementation should prioritize graceful degradation when configuration files are unavailable while maintaining full functionality when present, thus supporting the diverse ecosystem of original and fine-tuned models consistently.

griptapi avatar Nov 19 '25 13:11 griptapi