vllm [Feature] Prototype of vLLM execution on CPU-only devices

[Feature] Prototype of vLLM execution on CPU-only devices

Open bigPYJ1151 opened this issue 1 year ago • 51 comments

Hi, vLLM genius @WoosukKwon @zhuohan123. Motivated by some requirements to execute vLLM on the CPU (e.g., #176 ), we recently implemented an initial prototype for CPU-only execution on the x86 CPU platform.

What we have right now:

Minimize changes on vLLM core components to support CPU execution including:
- Introduced a new configuration argument device ('cuda' or 'cpu', 'cuda' by default) to specify the main device to execute vLLM.
- Replaced the hard coding device assignments (e.g., .cuda()) to .to(device=device), or with the context set_default_device, to support vLLM execution on different device types.
- Modified CacheEngine to allocate blocks from the CPU cache Tensor (used for swapping originally) under CPU-only mode. The size of the CPU cache can be specified with --swap-space.
Supporting FP32 and BF16 data type.
Native operators implemented for x86 CPU using AVX512_BF16 inst. set.
Operator dispatcher based on the device type of input Tensors.
Integration with the existing build script:
- The building of CPU operators is controlled by an env VLLM_BUILD_CPU_OPS, which is disabled by default.
- Due to the compatibility of CUDA, the CPU operators can only use gcc-12 and g++-12 to support AVX512_BF16 inst. set.

Install Instruction

Make sure the default version of gcc/g++ is 12
Install the PyTorch with pip install torch==2.1.2+cpu --index-url https://download.pytorch.org/whl/cpu
Build the source with VLLM_BUILD_CPU_ONLY=1 MAX_JOBS=8 pip install --no-build-isolation -v -e .

Known Limits:

Tensor parallelism is not supported right now.
FP16 is not fully supported due to the inst. set limits.
Quantization is not supported right now.
~~Sliding window attention is not verified right now.~~

Model Support:

We only verified LlamaForCausalLM, MistralForCausalLM, OPTForCausalLM related models currently.
Ideally, this implementation can support all implemented models without the modification of model definitions.

Performance We used the following commands to evaluate the performance with vicuna-7b-v1.5 on Intel (R) Xeon (R) CPU Max 9462 platform with 32 physical cores:

OMP_NUM_THREADS=32 numactl --physcpubind=0-31 --membind=0 python benchmark_throughput.py --backend=vllm --dataset=/root/ShareGPT_V3_unfiltered_cleaned_split.json --model=/root/vicuna-7b-v1.5/ --n=1 --num-prompts=1000 --dtype=bfloat16 --trust-remote-code --device=cpu --swap-space=40

The implementation achieved good throughput on the CPU platform: ~~Throughput: 0.76 requests/s, 358.22 tokens/s~~ Throughput: 1.00 requests/s, 479.15 tokens/s

The performance still has much improvement space, and we will optimize the performance and add remaining features continuously, hoping to be helpful for the users want to deploy vLLM on the CPU.

Please help to review the code and welcome any feedbacks, thanks!

Sep 13 '23 07:09 bigPYJ1151

vllm vllm copied to clipboard

[Feature] Prototype of vLLM execution on CPU-only devices

vllm
vllm copied to clipboard