vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Usage]: How to offload some layers to CPU?

Open cheney369 opened this issue 10 months ago • 5 comments

Your current environment

None

How would you like to use vllm

I want to load qwen2-14B-chat using VLLM, but I only have 1 RTX4090(24G). Can vllm offload some layers to cpu and others to gpu? As I know, the transformers-accelerate and llama.cpp can do it. But I want to use the multilora switch function in VLLM.

cheney369 avatar Apr 09 '24 09:04 cheney369

https://github.com/vllm-project/vllm/issues/3563

eigen2017 avatar Apr 23 '24 12:04 eigen2017

https://github.com/vllm-project/vllm/issues/627

eigen2017 avatar Apr 24 '24 02:04 eigen2017

https://github.com/bd-iaas-us/vllm/pull/1

eigen2017 avatar Apr 24 '24 06:04 eigen2017

https://github.com/bd-iaas-us/vllm/issues/3

eigen2017 avatar Apr 24 '24 06:04 eigen2017

it's not a good idea to use cpu mem since vllm is for inference accelerate . there is a trade-off choice that if we can cut some weight to fit poor HBM, moe models can cut some experts etc. see this: https://github.com/huggingface/transformers/pull/30552

eigen2017 avatar May 06 '24 12:05 eigen2017