mlx-vlm Video2Text Inference is slow and high vram consumption

Hi,

I want to process a 90-seconds video, but the memory is overflow. Is there any solution to decrease the vram consumption? Thanks.

python -m mlx_vlm.video_generate --model mlx-community/Qwen2-VL-7B-Instruct-bf16 --max-tokens 500 --prompt "Describe this video" --video /Users/mdsadmin/demos/Excavator.mp4 --max-pixels 720 410 --fps 1.0
Loading model: mlx-community/Qwen2-VL-7B-Instruct-bf16
Fetching 14 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 44183.79it/s]
==========
Video: /Users/mdsadmin/demos/Excavator.mp4 

Prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|video_pad|><|vision_end|>Describe this video<|im_end|>
<|im_start|>assistant

qwen-vl-utils using torchvision to read video.
Generating video description...
libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 190794240000 bytes which is greater than the maximum allowed buffer size of 77309411328 bytes.

Nov 21 '24 07:11 southkorea2013

Could you share the specs of your machine?

Nov 21 '24 07:11 Blaizzy

I would recommend:

Trying 8bit or 4bit quants.
Trying the 2B version.
Or lowering the resolution further to 512 or 224

Nov 21 '24 07:11 Blaizzy

Hi Prince,

My testing Machine is: M3 Max 128G ram.

Thanks, Nan

Nov 21 '24 07:11 southkorea2013

I would recommend:

Trying 8bit or 4bit quants.

Trying the 2B version.

Or lowering the resolution further to 512 or 224

Ok, Thanks.

Nov 21 '24 07:11 southkorea2013

Awesome!

It should work fine if you just lower the resolution.

I have M3 Max with 96GB URAM.

I can run this example in under a minute: https://github.com/Blaizzy/mlx-vlm/blob/62bb0ee2f57354de4cd27e42be593049269353a4/examples/video_generation.ipynb

Nov 21 '24 07:11 Blaizzy

Ok, Thanks

My pleasure!

Nov 21 '24 07:11 Blaizzy

Closing stale

Nov 10 '25 12:11 Blaizzy