Improvements: Papers
VisionZip A simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance.
https://arxiv.org/pdf/2412.04467
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
In this work, we propose a training-free adaptive inference method for multimodal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop. Our method consists of a) iterative token merging based on embedding similarity before LLMs, and b) progressive token pruning within LLM layers based on multi-modal importance. With a minimalist design, our method can be applied to both video and image LLMs. Extensive experiments on diverse video and image benchmarks demonstrate that, our method substantially reduces computation load (e.g., a 7-fold reduction in FLOPs) while preserving the performance of video and image LLMs.
https://arxiv.org/pdf/2412.03248