Support for dedicated vision model
Here’s a GitHub-ready markdown version you can copy and paste:
Title
Support calling a separate vision model for image-to-text preprocessing
Summary
Many affordable or lightweight models supported by agent-zero are text-only. When trying to process images, the system fails because those models don’t support vision input. Currently, attempts to send images return a 404 Not Found from the OpenRouter endpoint, indicating that no suitable model endpoint was found.
Problem
When using litellm with agent-zero and trying to process images, calls fail if the primary model does not support vision. Example traceback:
litellm.exceptions.NotFoundError: OpenrouterException - {"error":{"message":"No endpoints found that support image input","code":404}}
This limits the use of cheaper or smaller models for tasks that sometimes require image understanding, because the entire request fails rather than allowing a fallback.
Proposed Feature
- Add an option to specify a secondary vision model in the configuration.
- When an image is provided and the primary model does not support vision, agent-zero should:
- Send the image to the vision-capable model.
- Get back text output (e.g., description, OCR, caption).
- Pass the text to the primary chat model.
- Ideally, make this behavior opt-in via config or parameter.
Benefits
- Enables users to use more affordable text-only models for most tasks while still supporting occasional vision inputs.
- Prevents failures when image inputs are accidentally or occasionally provided.
- Expands agent-zero’s flexibility for multi-modal workflows without requiring a single expensive model.
Environment / Versions
- Python 3.12
- litellm
- agent-zero
- Example endpoint:
https://openrouter.ai/api/v1/chat/completions