Support for dedicated vision model

Open wojons opened this issue 4 months ago • 0 comments

Here’s a GitHub-ready markdown version you can copy and paste:

Title

Support calling a separate vision model for image-to-text preprocessing

Summary

Many affordable or lightweight models supported by agent-zero are text-only. When trying to process images, the system fails because those models don’t support vision input. Currently, attempts to send images return a 404 Not Found from the OpenRouter endpoint, indicating that no suitable model endpoint was found.

Problem

When using litellm with agent-zero and trying to process images, calls fail if the primary model does not support vision. Example traceback:

litellm.exceptions.NotFoundError: OpenrouterException - {"error":{"message":"No endpoints found that support image input","code":404}}

This limits the use of cheaper or smaller models for tasks that sometimes require image understanding, because the entire request fails rather than allowing a fallback.

Proposed Feature

Add an option to specify a secondary vision model in the configuration.
When an image is provided and the primary model does not support vision, agent-zero should:
1. Send the image to the vision-capable model.
2. Get back text output (e.g., description, OCR, caption).
3. Pass the text to the primary chat model.
Ideally, make this behavior opt-in via config or parameter.

Benefits

Enables users to use more affordable text-only models for most tasks while still supporting occasional vision inputs.
Prevents failures when image inputs are accidentally or occasionally provided.
Expands agent-zero’s flexibility for multi-modal workflows without requiring a single expensive model.

Environment / Versions

Python 3.12
litellm
agent-zero
Example endpoint: https://openrouter.ai/api/v1/chat/completions

Aug 25 '25 01:08 wojons