vllm PowerInfer : using a combination of cpu and gpu for faster Inference

PowerInfer : using a combination of cpu and gpu for faster Inference

Open nivibilla opened this issue 8 months ago • 3 comments

Splitting hot and cold neurons across cpu and gpu allows faster Inference when using larger models/higher quantisations. Demo shows 11x speedup over llama.cpp when using a 40b on a single 24gb GPU.

Demo https://twitter.com/omarsar0/status/1737168751668187229?t=blU8xZMb7JMJTtAHra7zvQ&s=19

GitHub https://github.com/SJTU-IPADS/PowerInfer

Wondering if this is something that can also be integrated into vllm.

Dec 20 '23 04:12 nivibilla

It is designed to improve speed of mainly sparse LLMs. It won't allow faster inference with dense LLMs.

Dec 22 '23 05:12 i-amgeek

They show sparsity even in dense models like falcon. But I guess mixtral MoE is a better candidate

Dec 22 '23 06:12 nivibilla

It is designed to improve speed of mainly sparse LLMs. It won't allow faster inference with dense LLMs.

but there still many LLMs using the ReLU activation function, so this can still have a chance?

Dec 26 '23 08:12 libratiger

vllm vllm copied to clipboard

PowerInfer : using a combination of cpu and gpu for faster Inference

vllm
vllm copied to clipboard