feat: Added Images attachment support, generator updates and related changes (#1002)
Motivation
EXO was previously limited to text-only Large Language Models (LLMs). This change is needed to expand EXO's capabilities into the multimodal space, specifically allowing users to interact with Vision-Language Models (VLMs). It solves the problem of not being able to process image inputs, which is essential for modern AI applications like image description, visual reasoning, and multimodal chat.
#1002
Changes
- Dependency Management: Integrated mlx-vlm as a core dependency and added torch and torchvision to pyproject.toml to support the underlying vision processing metadata required by transformer-based VLMs.
- Vision Logic: Created src/exo/worker/engines/mlx/generator/generate_vlm.py which acts as the dedicated router for multimodal requests. It includes logic to extract Base64 encoded images from various message formats (dicts, Pydantic objects, strings).
Core Engine Updates:
- Updated utils_mlx.py to correctly load vision models using mlx_vlm.load() instead of standard mlx_lm to prevent hanging.
- Modified model detection to identify "Vision Processors" by checking for the absence of eos_token_id at the top-level object. Template System:
- Enhanced apply_chat_template to support OpenAI-style multimodal content structures, allowing a mix of text and image_url blocks in a single message.
Model Registry:
- Registered mlx-community/Qwen2-VL-2B-Instruct-4bit in model_cards.py and mlx-community/Qwen3-VL-2B-Instruct-4bit
- Added supports_vision=True metadata flag.
- Added a special function in model_meta, found the cause of the error. Multimodal models like SmolVLM2 and Qwen3-VL nest their architectural settings (like layer count and hidden size) inside a text_config property in config.json (can be checked from hugging face), but Exo was only looking at the top level.
Dashboard:
- Updated the Svelte-based dashboard frontend (app.svelte.ts) to correctly serialize image uploads into the multimodal payload structure required by the updated backend.
Why It Works
The approach works by implementing a "Vision-Aware" routing layer. Instead of trying to force VLMs into the standard LLM generation pipeline (which causes crashes because VLMs expect different input tensors and processor handling), the system now detects the model type at load time. By identifying the model as a "Vision Processor," EXO knows to use the mlx-vlm generation logic. The implementation of extract_images_from_messages ensures that even if a model is vision-capable but the current query is text-only, the system routes it through the correct processor bridge, preventing the "LanguageModelOutput object is not subscriptable" error typically seen when standard pipelines interact with VLM heads.
Test Plan
Manual Testing
Hardware : Macbook AIr M3 16GB
API Functional Verification:
- Sent a POST request containing a 100x100 Red Square Base64 image to the Qwen2-VL model. Verified the model correctly identifies visual content.
Dashboard End-to-End:
- Uploaded image file via the dashboard UI, added a text prompt "What is in this image?", and confirmed the model's response appeared correctly in the chat window.
- Follow-up Chat: Verified that after an image is processed, the model can still handle subsequent text-only follow-ups in the same session without crashing.
- Launched an instance with two VLMs (Qwen2 and Qwen3) and tested conversations by switching between the models in same chat.
- Tested with several .JPEG and .PNG format images
Automated Testing
Warmup Logic Testing:
- Verified that the startup/warmup sequence no longer triggers AttributeError by running uv run exo with a pre-loaded vision model.
- The fix ensures the system can extract the tokenizer from the vision processor during initial health checks.
Multimodal Extraction Tests:
- Used check_multimodal.py (integration script) to ensure the image extraction logic is robust against different input schemas (OpenAI format vs. simplified dashboard format). (not committed)
Note : UV lock file dependency updates needs a second look.