Video2Text Inference is slow and high vram consumption
Hi,
I want to process a 90-seconds video, but the memory is overflow. Is there any solution to decrease the vram consumption? Thanks.
python -m mlx_vlm.video_generate --model mlx-community/Qwen2-VL-7B-Instruct-bf16 --max-tokens 500 --prompt "Describe this video" --video /Users/mdsadmin/demos/Excavator.mp4 --max-pixels 720 410 --fps 1.0
Loading model: mlx-community/Qwen2-VL-7B-Instruct-bf16
Fetching 14 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 44183.79it/s]
==========
Video: /Users/mdsadmin/demos/Excavator.mp4
Prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|video_pad|><|vision_end|>Describe this video<|im_end|>
<|im_start|>assistant
qwen-vl-utils using torchvision to read video.
Generating video description...
libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 190794240000 bytes which is greater than the maximum allowed buffer size of 77309411328 bytes.
Could you share the specs of your machine?
I would recommend:
- Trying 8bit or 4bit quants.
- Trying the 2B version.
- Or lowering the resolution further to 512 or 224
Hi Prince,
My testing Machine is: M3 Max 128G ram.
Thanks, Nan
I would recommend:
- Trying 8bit or 4bit quants.
- Trying the 2B version.
- Or lowering the resolution further to 512 or 224
Ok, Thanks.
Awesome!
It should work fine if you just lower the resolution.
I have M3 Max with 96GB URAM.
I can run this example in under a minute: https://github.com/Blaizzy/mlx-vlm/blob/62bb0ee2f57354de4cd27e42be593049269353a4/examples/video_generation.ipynb
Ok, Thanks
My pleasure!
Closing stale