vllm [Usage]: How to offload some layers to CPU？

[Usage]: How to offload some layers to CPU？

Open cheney369 opened this issue 1 year ago • 5 comments

trafficstars

Your current environment

None

How would you like to use vllm

I want to load qwen2-14B-chat using VLLM, but I only have 1 RTX4090(24G). Can vllm offload some layers to cpu and others to gpu? As I know, the transformers-accelerate and llama.cpp can do it. But I want to use the multilora switch function in VLLM.

Apr 09 '24 09:04 cheney369

https://github.com/vllm-project/vllm/issues/3563

Apr 23 '24 12:04 eigen2017

https://github.com/vllm-project/vllm/issues/627

Apr 24 '24 02:04 eigen2017

https://github.com/bd-iaas-us/vllm/pull/1

Apr 24 '24 06:04 eigen2017

https://github.com/bd-iaas-us/vllm/issues/3

Apr 24 '24 06:04 eigen2017

it's not a good idea to use cpu mem since vllm is for inference accelerate . there is a trade-off choice that if we can cut some weight to fit poor HBM, moe models can cut some experts etc. see this: https://github.com/huggingface/transformers/pull/30552

May 06 '24 12:05 eigen2017

vllm vllm copied to clipboard

[Usage]: How to offload some layers to CPU？

Your current environment

How would you like to use vllm

vllm
vllm copied to clipboard