vllm [Core][Kernel][Misc] Support external swapper for vllm

Hi,

In the previous version of vllm, when the GPU memory is insufficient, the kv cache needs to be swapped to the CPU memory. Based on the CPU, we abstracted an external swapper interface and implemented the local file to store kv cache. Other distributed storage implementations may be added in the future. By adding external swapper, the storage space of token is greatly expanded.

The specific design is as follows: The kv cache is stored in a hierarchical structure, with the following levels:

GPU ---> CPU ----> External Swapper.

The storage space is getting bigger, but the latency is also getting longer. When the generated kv cache space is small, it will be stored in the area with smaller latency first. When the space in this area is insufficient, it will be swapped to the outside area. We continue to use the previous vllm scheduling and executor related strategies, and simply abstract an external swapper interface under the cache engine to connect to different swappers.

Currently, the external swapper of local file is implemented, and the external swapper of valkey (RDMA version of redis) will be implemented next.

For the implemented local file, we also conducted some benchmark tests. Our test environment is Nvidia A10 with 4 cards and NVME disk.

Kernel benchmark（benchmark/kernels/benchmark_swap_blocks）: The execution time of a single kernel increased by about 60%. Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 seconds Avg. GPU->File time taken for swapping blocks: 0.026778271198272707 seconds Avg. File->GPU time taken for swapping blocks: 0.025919318199157715 seconds
Online server benchmark The throughput performance of tokens has dropped very little, about 2%. 2.1 Swap to CPU(997 times swap): Server: python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode swap --tensor-parallel-size 4 --swap-space 40 Client: python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000

2.2 Swap to File(996 times swap): Server: python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode swap --swap-space 0 --external-swapper file:///root/test --external-swapper-space 40 --tensor-parallel-size 4 Client: python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000

Aug 30 '24 03:08 zeroorhero

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Aug 30 '24 03:08 github-actions[bot]

pin_memory has a great impact on swapping blocks.

more specifically

benchmarks/kernels/benchmark_swap_blocks.py

+ from light_vllm.utils import is_pin_memory_available
+ pin_memory = is_pin_memory_available()

- dst = torch.zeros_like(src).cpu()
+ dst = torch.zeros_like(src, pin_memory=pin_memory, device="cpu")

Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 seconds 16ms is so weird

Aug 30 '24 06:08 noooop

pin_memory has a great impact on swapping blocks.

more specifically

benchmarks/kernels/benchmark_swap_blocks.py
+ from light_vllm.utils import is_pin_memory_available
+ pin_memory = is_pin_memory_available()

- dst = torch.zeros_like(src).cpu()
+ dst = torch.zeros_like(src, pin_memory=pin_memory, device="cpu")
Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 seconds 16ms is so weird

@noooop Thank you very much for your suggestion. I will make some changes and conduct relevant tests. And could you help me find other reviewers? Help me see if this solution is feasible?

Aug 30 '24 06:08 zeroorhero

~~How should I say it tactfully?~~

~~in my scenario ddr4 3600 32G *4 only has ≈20G/s bandwidth （Compare with 4090 1T/s）, so cpu memory SWAP is almost useless to me.~~

Aug 30 '24 06:08 noooop

~~Maybe external swapper can be used in future async schedulers, it is too slow for current synchronous schedulers.~~

Aug 30 '24 06:08 noooop

Add a benchmark result. 2. Online server benchmark 2.3 Recompute(997 times recompute): The recompute result is basically the same as the result of swapping to the CPU. Server: python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode recompute --tensor-parallel-size 4 Client: python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000

Aug 30 '24 09:08 zeroorhero

@DarkLight1337 @ywang96 @youkaichao hi, I have simply implemented an external storage. Please help review the codes.

Sep 03 '24 13:09 zeroorhero

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @zeroorhero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Nov 26 '24 05:11 mergify[bot]

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Feb 25 '25 02:02 github-actions[bot]

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

May 27 '25 02:05 github-actions[bot]

vllm vllm copied to clipboard

[Core][Kernel][Misc] Support external swapper for vllm

vllm
vllm copied to clipboard