vllm
vllm copied to clipboard
[Feature] Prototype of vLLM execution on CPU-only devices
Hi, vLLM genius @WoosukKwon @zhuohan123. Motivated by some requirements to execute vLLM on the CPU (e.g., #176 ), we recently implemented an initial prototype for CPU-only execution on the x86 CPU platform.
What we have right now:
- Minimize changes on vLLM core components to support CPU execution including:
- Introduced a new configuration argument
device ('cuda' or 'cpu', 'cuda' by default)
to specify the main device to execute vLLM. - Replaced the hard coding device assignments (e.g.,
.cuda()
) to.to(device=device)
, or with the contextset_default_device
, to support vLLM execution on different device types. - Modified
CacheEngine
to allocate blocks from the CPU cache Tensor (used for swapping originally) under CPU-only mode. The size of the CPU cache can be specified with--swap-space
.
- Introduced a new configuration argument
- Supporting FP32 and BF16 data type.
- Native operators implemented for x86 CPU using
AVX512_BF16
inst. set. - Operator dispatcher based on the device type of input Tensors.
- Integration with the existing build script:
- The building of CPU operators is controlled by an env
VLLM_BUILD_CPU_OPS
, which is disabled by default. - Due to the compatibility of CUDA, the CPU operators can only use
gcc-12
andg++-12
to supportAVX512_BF16
inst. set.
- The building of CPU operators is controlled by an env
Install Instruction
- Make sure the default version of
gcc/g++
is 12 - Install the
PyTorch
withpip install torch==2.1.2+cpu --index-url https://download.pytorch.org/whl/cpu
- Build the source with
VLLM_BUILD_CPU_ONLY=1 MAX_JOBS=8 pip install --no-build-isolation -v -e .
Known Limits:
- Tensor parallelism is not supported right now.
- FP16 is not fully supported due to the inst. set limits.
- Quantization is not supported right now.
- ~~Sliding window attention is not verified right now.~~
Model Support:
- We only verified
LlamaForCausalLM
,MistralForCausalLM
,OPTForCausalLM
related models currently. - Ideally, this implementation can support all implemented models without the modification of model definitions.
Performance
We used the following commands to evaluate the performance with vicuna-7b-v1.5
on Intel (R) Xeon (R) CPU Max 9462 platform with 32 physical cores:
OMP_NUM_THREADS=32 numactl --physcpubind=0-31 --membind=0 python benchmark_throughput.py --backend=vllm --dataset=/root/ShareGPT_V3_unfiltered_cleaned_split.json --model=/root/vicuna-7b-v1.5/ --n=1 --num-prompts=1000 --dtype=bfloat16 --trust-remote-code --device=cpu --swap-space=40
The implementation achieved good throughput on the CPU platform:
~~Throughput: 0.76 requests/s, 358.22 tokens/s
~~
Throughput: 1.00 requests/s, 479.15 tokens/s
The performance still has much improvement space, and we will optimize the performance and add remaining features continuously, hoping to be helpful for the users want to deploy vLLM on the CPU.
Please help to review the code and welcome any feedbacks, thanks!