agent-zero icon indicating copy to clipboard operation
agent-zero copied to clipboard

Support for dedicated vision model

Open wojons opened this issue 4 months ago • 0 comments

Here’s a GitHub-ready markdown version you can copy and paste:

Title

Support calling a separate vision model for image-to-text preprocessing

Summary

Many affordable or lightweight models supported by agent-zero are text-only. When trying to process images, the system fails because those models don’t support vision input. Currently, attempts to send images return a 404 Not Found from the OpenRouter endpoint, indicating that no suitable model endpoint was found.

Problem

When using litellm with agent-zero and trying to process images, calls fail if the primary model does not support vision. Example traceback:

litellm.exceptions.NotFoundError: OpenrouterException - {"error":{"message":"No endpoints found that support image input","code":404}}

This limits the use of cheaper or smaller models for tasks that sometimes require image understanding, because the entire request fails rather than allowing a fallback.

Proposed Feature

  • Add an option to specify a secondary vision model in the configuration.
  • When an image is provided and the primary model does not support vision, agent-zero should:
    1. Send the image to the vision-capable model.
    2. Get back text output (e.g., description, OCR, caption).
    3. Pass the text to the primary chat model.
  • Ideally, make this behavior opt-in via config or parameter.

Benefits

  • Enables users to use more affordable text-only models for most tasks while still supporting occasional vision inputs.
  • Prevents failures when image inputs are accidentally or occasionally provided.
  • Expands agent-zero’s flexibility for multi-modal workflows without requiring a single expensive model.

Environment / Versions

  • Python 3.12
  • litellm
  • agent-zero
  • Example endpoint: https://openrouter.ai/api/v1/chat/completions

wojons avatar Aug 25 '25 01:08 wojons