Models should not need to be re-loaded between back-to-back prompts
When using mlx-vlm through the python API, we need to call mlx_vlm.utils.load for before every request to stream_generate. We need to do this because we see Exceptions being raised when we try to call stream_generate without reloading the model. We are seeing Exceptions across multiple VLM architectures when not re-loaded. The exceptions are different from one another between architectures, there is usually some state that is not being reset
Hey @neilmehta24
Thanks for reporting this!
Could you share a reproducible example?
The exceptions are different from one another between architectures, there is usually some state that is not being reset
I usually use stream_generate in the manner you refer to in dev.
So I suspect the KV cache or the increase in input size (i.e., number of images). The former is easy to fix, the latter has limitations because not all models support multiple images and/or multi-turn conversation.
My LLM made this suggestion after digging around - it's helping me out and i can run multiple queries without reloading the model:
MLX-VLM Multi-Query Fix
Issue: MLX-VLM fails after the first generate() call with error: ValueError: input operand has more dimensions than allowed by the axis remapping
Root Cause: In mlx_vlm/models/llava_next/llava_next.py, the get_input_embeddings method modifies self.image_newline in-place:
Line 107-108: This modifies the array from 1D to 3D
self.image_newline = np.array(self.image_newline)[None, None, :] self.image_newline = np.broadcast_to(self.image_newline, image_features.shape)
On subsequent calls, self.image_newline is already 3D, causing dimension mismatch errors.
Fix: Save the original 1D state and restore it before each use:
def fixed_get_input_embeddings(self, input_ids, pixel_values): # Save original state on first call if hasattr(self, 'image_newline') and not hasattr(self, '_original_image_newline'): self._original_image_newline = np.array(self.image_newline).copy()
# Restore original state before each use
if hasattr(self, '_original_image_newline'):
self.image_newline = self._original_image_newline.copy()
# Continue with original method
return original_get_input_embeddings(self, input_ids, pixel_values)
Result: Multiple generate() calls now work without model reloading, improving performance by ~10x for batch processing.
Suggested permanent fix: Instead of modifying self.image_newline in-place, create a local variable:
Better approach - don't modify self.image_newline
image_newline_expanded = np.array(self.image_newline)[None, None, :] image_newline_broadcasted = np.broadcast_to(image_newline_expanded, image_features.shape)