vllm icon indicating copy to clipboard operation
vllm copied to clipboard

PowerInfer : using a combination of cpu and gpu for faster Inference

Open nivibilla opened this issue 8 months ago • 3 comments

Splitting hot and cold neurons across cpu and gpu allows faster Inference when using larger models/higher quantisations. Demo shows 11x speedup over llama.cpp when using a 40b on a single 24gb GPU.

Demo https://twitter.com/omarsar0/status/1737168751668187229?t=blU8xZMb7JMJTtAHra7zvQ&s=19

GitHub https://github.com/SJTU-IPADS/PowerInfer

Wondering if this is something that can also be integrated into vllm.

nivibilla avatar Dec 20 '23 04:12 nivibilla

It is designed to improve speed of mainly sparse LLMs. It won't allow faster inference with dense LLMs.

i-amgeek avatar Dec 22 '23 05:12 i-amgeek

They show sparsity even in dense models like falcon. But I guess mixtral MoE is a better candidate

nivibilla avatar Dec 22 '23 06:12 nivibilla

It is designed to improve speed of mainly sparse LLMs. It won't allow faster inference with dense LLMs.

but there still many LLMs using the ReLU activation function, so this can still have a chance?

libratiger avatar Dec 26 '23 08:12 libratiger