efficient-inference topic
Multistage_Pruning
Cheng-Hao Tu, Jia-Hong Lee, Yi-Ming Chan and Chu-Song Chen, "Pruning Depthwise Separable Convolutions for MobileNet Compression," International Joint Conference on Neural Networks, IJCNN 2020, July 20...
KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
fast_robust_early_exit
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)
picollm
On-device LLM Inference Powered by X-Bit Quantization
AsyncDiff
[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
AutoVP
[ICLR24] AutoVP: An Automated Visual Prompting Framework and Benchmark