mlx-audio
mlx-audio copied to clipboard
Voxtral not working with mlx-lm v0.28.x
Summary
The latest mlx-lm (v0.28.1) introduces a regression in generate_step that breaks models using input_embeddings, specifically affecting the Voxtral STT pipeline in mlx-audio.
The Bug
In the prefill loop, the calculation for tokens to process only considers prompt.size:
n_to_process = min(prefill_step_size, prompt.size - 1)
Problem: When input_embeddings is provided (common in audio/vision models), this calculation ignores the embeddings' length. If the prompt is empty or small:
-
prompt.size - 1becomes negative or zero - The loop processes 0 embeddings despite having valid input
- The model receives an empty array, causing a reshape error
Reproduction
python -m mlx_audio.stt.generate --model mlx-community/Voxtral-Mini-3B-2507-bf16 --audio /Users/prince_canuma/Downloads/conversational_b.wav --output output --verbose
Output showing the issue:
Transcription:
inputs shape: (1, 0) # Empty prompt (expected)
input_embeddings shape: (1, 382, 3072) # Valid embeddings from audio
inputs shape: (1, 0)
input_embeddings shape: (1, 0, 3072) # ❌ Embeddings incorrectly truncated to 0!
Traceback (most recent call last):
...
File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_lm/models/llama.py", line 83, in __call__
queries = queries.reshape(B, L, self.n_heads, -1).transpose(0, 2, 1, 3)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: [reshape] Cannot infer the shape of an empty array
Impact
This breaks any model that uses input_embeddings with empty/small prompts, including:
- Voxtral (speech-to-text)
- Vision-language models
- Any multimodal model using embeddings directly
Proposed Fix
https://github.com/ml-explore/mlx-lm/pull/606