vllm [CI/Build] A perplexity-computing test for the FP8 KV cache system. Originally used in the context of PR #3290

The script benchmarks/measure_pplv2_MC.py produces a realistic perplexity measurement for the quantized KV cache system by processing a sequence of non-overlapping patches of the reference text. Generation of the consecutive symbols in each patch is governed (forced) by the reference text.

The initial context size for the system is set by the parameter "--context-size".

The number of output symbols to generate starting from a given context is set by the parameter "--sample-size". This variable also defines the size of the individual patch. The size of the patch in tokens is equal to the sample size.

For the N-token reference text that is split into M patches with the system's initial context size C, the method takes M*preload + (N-C)*generation time to complete.

Quick correctness validation tips:

Running llama-2-7b-chat-hf model ( ./vllm/benchmarks/measure_ppl2_MC.py --model=/data/models/llama-2-7b-chat-hf --data=./vllm/tests/prompts/wiki.test.raw --context-size=1024 --sample-size=512 ) should result in PPL ~ 6.524227946419175

Running llama-2-7b-chat-hf model ( ./vllm/benchmarks/measure_ppl2_MC.py --model=/data/models/llama-2-7b-chat-hf --data=./vllm/tests/prompts/wiki.test.raw --context-size=1024 --sample-size=512 --patch-size=1 ) should result in PPL ~ PPL=3.8968611189957523

This testing method is sensitive to the representation precision of the KV cache. The table below presents perplexities, achieved with different quantization and scaling methods.

llama-2-7b-chat-hf	647 patches 330849 symb(max)
gen 512 X init 1024	PPLv2
FP8 Scaling 1e3	2016.919661
FP8 Scaling 1e2	7.110797102
FP8 Scaling 1e1	6.550152394
FP16	6.524227946
FP8	6.541197624
FP8 Scaling 1e-1	6.545720813
FP8 Scaling 1e-2	57.70005660

Mar 29 '24 11:03 Alexei-V-Ivanov-AMD

Hi @Alexei-V-Ivanov-AMD, this is a nice script to have at hand. Other packages like llama.cpp run perplexity tests in their CI, which I think vLLM maintainers should consider to avoid regressions.

Mar 30 '24 10:03 casper-hansen

cc @simon-mo can we review and get this PR in? it'll help unblock AMD team on adding more tests.. thanks..

Apr 02 '24 21:04 sunway513

Sounds good. I agree with @casper-hansen that this is very valuable and a good start for #3780

Apr 04 '24 17:04 simon-mo

At a high level I would imagine running more end to end test like https://github.com/EleutherAI/lm-evaluation-harness which can directly support vLLM with simpler command should be better?

For actual testing I would prefer using lm-eval. For this script, I think it has value to be put into examples folder?

Apr 04 '24 20:04 simon-mo

For actual testing I would prefer using lm-eval. For this script, I think it has value to be put into examples folder?

Agreed. Moving the script into 'examples' folder. Thank you!

Apr 05 '24 15:04 Alexei-V-Ivanov-AMD

Please assign a reviewer to look at this. I think this is valuable PR to allow simple PPL benchmarks. @WoosukKwon @simon-mo

Apr 23 '24 02:04 Qubitium

vllm vllm copied to clipboard

[CI/Build] A perplexity-computing test for the FP8 KV cache system. Originally used in the context of PR #3290

vllm
vllm copied to clipboard