mlx-audio icon indicating copy to clipboard operation
mlx-audio copied to clipboard

Voxtral not working with mlx-lm v0.28.x

Open Blaizzy opened this issue 3 months ago • 0 comments

Summary

The latest mlx-lm (v0.28.1) introduces a regression in generate_step that breaks models using input_embeddings, specifically affecting the Voxtral STT pipeline in mlx-audio.

The Bug

In the prefill loop, the calculation for tokens to process only considers prompt.size:

n_to_process = min(prefill_step_size, prompt.size - 1)

Problem: When input_embeddings is provided (common in audio/vision models), this calculation ignores the embeddings' length. If the prompt is empty or small:

  • prompt.size - 1 becomes negative or zero
  • The loop processes 0 embeddings despite having valid input
  • The model receives an empty array, causing a reshape error

Reproduction

python -m mlx_audio.stt.generate --model mlx-community/Voxtral-Mini-3B-2507-bf16 --audio /Users/prince_canuma/Downloads/conversational_b.wav --output output --verbose

Output showing the issue:

Transcription:
inputs shape: (1, 0)                    # Empty prompt (expected)
input_embeddings shape: (1, 382, 3072)  # Valid embeddings from audio
inputs shape: (1, 0)
input_embeddings shape: (1, 0, 3072)    # ❌ Embeddings incorrectly truncated to 0!

Traceback (most recent call last):
  ...
  File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_lm/models/llama.py", line 83, in __call__
    queries = queries.reshape(B, L, self.n_heads, -1).transpose(0, 2, 1, 3)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: [reshape] Cannot infer the shape of an empty array

Impact

This breaks any model that uses input_embeddings with empty/small prompts, including:

  • Voxtral (speech-to-text)
  • Vision-language models
  • Any multimodal model using embeddings directly

Proposed Fix

https://github.com/ml-explore/mlx-lm/pull/606

Blaizzy avatar Nov 12 '25 11:11 Blaizzy