mlx-vlm icon indicating copy to clipboard operation
mlx-vlm copied to clipboard

Improvements: Papers

Open Blaizzy opened this issue 1 year ago • 0 comments

VisionZip A simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance.

https://arxiv.org/pdf/2412.04467

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

In this work, we propose a training-free adaptive inference method for multimodal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop. Our method consists of a) iterative token merging based on embedding similarity before LLMs, and b) progressive token pruning within LLM layers based on multi-modal importance. With a minimalist design, our method can be applied to both video and image LLMs. Extensive experiments on diverse video and image benchmarks demonstrate that, our method substantially reduces computation load (e.g., a 7-fold reduction in FLOPs) while preserving the performance of video and image LLMs.

https://arxiv.org/pdf/2412.03248

Blaizzy avatar Dec 07 '24 00:12 Blaizzy