vllm
vllm copied to clipboard
[Core][Kernel][Misc] Support external swapper for vllm
Hi,
In the previous version of vllm, when the GPU memory is insufficient, the kv cache needs to be swapped to the CPU memory. Based on the CPU, we abstracted an external swapper interface and implemented the local file to store kv cache. Other distributed storage implementations may be added in the future. By adding external swapper, the storage space of token is greatly expanded.
The specific design is as follows: The kv cache is stored in a hierarchical structure, with the following levels:
GPU ---> CPU ----> External Swapper.
The storage space is getting bigger, but the latency is also getting longer. When the generated kv cache space is small, it will be stored in the area with smaller latency first. When the space in this area is insufficient, it will be swapped to the outside area. We continue to use the previous vllm scheduling and executor related strategies, and simply abstract an external swapper interface under the cache engine to connect to different swappers.
Currently, the external swapper of local file is implemented, and the external swapper of valkey (RDMA version of redis) will be implemented next.
For the implemented local file, we also conducted some benchmark tests. Our test environment is Nvidia A10 with 4 cards and NVME disk.
-
Kernel benchmark(benchmark/kernels/benchmark_swap_blocks): The execution time of a single kernel increased by about 60%.
Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 secondsAvg. GPU->File time taken for swapping blocks: 0.026778271198272707 secondsAvg. File->GPU time taken for swapping blocks: 0.025919318199157715 seconds -
Online server benchmark The throughput performance of tokens has dropped very little, about 2%. 2.1 Swap to CPU(997 times swap): Server:
python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode swap --tensor-parallel-size 4 --swap-space 40Client:python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 10002.2 Swap to File(996 times swap): Server:
python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode swap --swap-space 0 --external-swapper file:///root/test --external-swapper-space 40 --tensor-parallel-size 4Client:python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.
Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).
To run full CI, you can do one of these:
- Comment
/readyon the PR - Add
readylabel to the PR - Enable auto-merge.
🚀
pin_memory has a great impact on swapping blocks.
more specifically
benchmarks/kernels/benchmark_swap_blocks.py
+ from light_vllm.utils import is_pin_memory_available
+ pin_memory = is_pin_memory_available()
- dst = torch.zeros_like(src).cpu()
+ dst = torch.zeros_like(src, pin_memory=pin_memory, device="cpu")
Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 seconds 16ms is so weird
pin_memory has a great impact on swapping blocks.
more specifically
benchmarks/kernels/benchmark_swap_blocks.py
+ from light_vllm.utils import is_pin_memory_available + pin_memory = is_pin_memory_available() - dst = torch.zeros_like(src).cpu() + dst = torch.zeros_like(src, pin_memory=pin_memory, device="cpu")Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 seconds 16ms is so weird
@noooop Thank you very much for your suggestion. I will make some changes and conduct relevant tests. And could you help me find other reviewers? Help me see if this solution is feasible?
~~How should I say it tactfully?~~
~~Maybe external swapper can be used in future async schedulers, it is too slow for current synchronous schedulers.~~
Add a benchmark result.
2. Online server benchmark
2.3 Recompute(997 times recompute):
The recompute result is basically the same as the result of swapping to the CPU.
Server:
python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode recompute --tensor-parallel-size 4
Client:
python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000
@DarkLight1337 @ywang96 @youkaichao hi, I have simply implemented an external storage. Please help review the codes.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @zeroorhero.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!