vllm
vllm copied to clipboard
[Usage]: How to offload some layers to CPU?
Your current environment
None
How would you like to use vllm
I want to load qwen2-14B-chat using VLLM, but I only have 1 RTX4090(24G). Can vllm offload some layers to cpu and others to gpu? As I know, the transformers-accelerate and llama.cpp can do it. But I want to use the multilora switch function in VLLM.
https://github.com/vllm-project/vllm/issues/3563
https://github.com/vllm-project/vllm/issues/627
https://github.com/bd-iaas-us/vllm/pull/1
https://github.com/bd-iaas-us/vllm/issues/3
it's not a good idea to use cpu mem since vllm is for inference accelerate . there is a trade-off choice that if we can cut some weight to fit poor HBM, moe models can cut some experts etc. see this: https://github.com/huggingface/transformers/pull/30552