vllm
vllm copied to clipboard
The A100 test performance did not match the official test results
I install vllm with
pip install vllm
then use that command start server
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.api_server --model llama-7b-hf/ --swap-space 16 --disable-log-requests --port 9009
benchmark with that
python benchmark_serving.py --backend vllm --tokenizer ./llama-7b-hf/ --dataset ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 200 --host 127.0.0.1 --port 9009
output:
Namespace(backend='vllm', best_of=1, dataset='ShareGPT_V3_unfiltered_cleaned_split.json', host='127.0.0.1', num_prompts=1000, port=9009, request_rate=200.0, seed=0, tokenizer='/data/zhaohb/llama-7b-hf/', use_beam_search=False)
Token indices sequence length is longer than the specified maximum sequence length for this model (3152 > 2048). Running this sequence through the model will result in indexing errors
Total time: 166.88 s
Throughput: 5.99 requests/s
Average latency: 67.59 s
Average latency per token: 0.21 s
Average latency per output token: 1.10 s
This is far from official, how to fix that? Moreover, I used benchmark_throughput.py test, and the performance was also very different:
python benchmark_throughput.py --model /data/zhaohb/llama-7b-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', hf_max_batch_size=None, model='/data/zhaohb/llama-7b-hf', n=1, num_prompts=1000, seed=0, tensor_parallel_size=1, use_beam_search=False)
Token indices sequence length is longer than the specified maximum sequence length for this model (3152 > 2048). Running this sequence through the model will result in indexing errors
INFO 07-04 09:54:21 llm_engine.py:59] Initializing an LLM engine with config: model='/data/zhaohb/llama-7b-hf', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 07-04 09:54:21 tokenizer_utils.py:30] Using the LLaMA fast tokenizer in 'hf-internal-testing/llama-tokenizer' to avoid potential protobuf errors.
INFO 07-04 09:54:32 llm_engine.py:128] # GPU blocks: 7438, # CPU blocks: 512
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [02:37<00:00, 6.36it/s]
Throughput: 6.36 requests/s, 3040.51 tokens/s
What's the problem?