vllm
vllm copied to clipboard
[CI/Build] A perplexity-computing test for the FP8 KV cache system. Originally used in the context of PR #3290
The script benchmarks/measure_pplv2_MC.py produces a realistic perplexity measurement for the quantized KV cache system by processing a sequence of non-overlapping patches of the reference text. Generation of the consecutive symbols in each patch is governed (forced) by the reference text.
The initial context size for the system is set by the parameter "--context-size".
The number of output symbols to generate starting from a given context is set by the parameter "--sample-size". This variable also defines the size of the individual patch. The size of the patch in tokens is equal to the sample size.
For the N-token reference text that is split into M patches with the system's initial context size C, the method takes M*preload + (N-C)*generation time to complete.
Quick correctness validation tips:
Running llama-2-7b-chat-hf model ( ./vllm/benchmarks/measure_ppl2_MC.py --model=/data/models/llama-2-7b-chat-hf --data=./vllm/tests/prompts/wiki.test.raw --context-size=1024 --sample-size=512 ) should result in PPL ~ 6.524227946419175
Running llama-2-7b-chat-hf model ( ./vllm/benchmarks/measure_ppl2_MC.py --model=/data/models/llama-2-7b-chat-hf --data=./vllm/tests/prompts/wiki.test.raw --context-size=1024 --sample-size=512 --patch-size=1 ) should result in PPL ~ PPL=3.8968611189957523
This testing method is sensitive to the representation precision of the KV cache. The table below presents perplexities, achieved with different quantization and scaling methods.
| llama-2-7b-chat-hf | 647 patches 330849 symb(max) |
|---|---|
| gen 512 X init 1024 | PPLv2 |
| FP8 Scaling 1e3 | 2016.919661 |
| FP8 Scaling 1e2 | 7.110797102 |
| FP8 Scaling 1e1 | 6.550152394 |
| FP16 | 6.524227946 |
| FP8 | 6.541197624 |
| FP8 Scaling 1e-1 | 6.545720813 |
| FP8 Scaling 1e-2 | 57.70005660 |
Hi @Alexei-V-Ivanov-AMD, this is a nice script to have at hand. Other packages like llama.cpp run perplexity tests in their CI, which I think vLLM maintainers should consider to avoid regressions.
cc @simon-mo can we review and get this PR in? it'll help unblock AMD team on adding more tests.. thanks..
Sounds good. I agree with @casper-hansen that this is very valuable and a good start for #3780
At a high level I would imagine running more end to end test like https://github.com/EleutherAI/lm-evaluation-harness which can directly support vLLM with simpler command should be better?
For actual testing I would prefer using lm-eval. For this script, I think it has value to be put into examples folder?
For actual testing I would prefer using lm-eval. For this script, I think it has value to be put into
examplesfolder?
Agreed. Moving the script into 'examples' folder. Thank you!
Please assign a reviewer to look at this. I think this is valuable PR to allow simple PPL benchmarks. @WoosukKwon @simon-mo