LLMLingua [Question]: Reproduce end2end latency results of LLMLingua-2

Describe the issue

@pzs19
I would like to reproduce and expand the end2end latency benchmark results of the LLMLingua-2 paper and was therefore wondering if you could provide more details on your experiment setup? Specifically:

Which target LLM was evaluated (and how was it set up, was vLLM or similar used?)
For the result in Table 5, which prompt length was used, what was the prompt?
Whats the definition of end2end latency? From the beginning of compression until the generation of the first token or until the full response is generated?
What was max_token set to, and did you enforce the generation of a minimum number of tokens?

Thanks a lot!

Oct 23 '24 12:10 cornzz

Thank you for raising the questions. There is point to point response:

The target LLM is GPT-3.5-Turbo-0613, so vllm is not used.
The latency experiment is conducted on the summarization task of MeetingBank, the prompt follows the main experiment.
End2end latency counts from the beginning of compression until the full response is generated.
We set the "max_token" to 400, following the main experiment.

Nov 11 '24 07:11 pzs19

Thank you very much! 🙂

Nov 11 '24 11:11 cornzz

@pzs19 @iofu728 sorry, a follow up question: which LLM was used for compression in the end-to-end latency benchmark of the original LLMLingua paper? Under "Implementation Details" it says

In our experiments, we utilize either Alpaca-7B4 or GPT2-Alpaca as the small pre-trained language model M𝑠 for compression.

however, as far as I can see, it is not specified which of those two models was used for the end-to-end latency benchmark. Actually it is not specified which compressor was used for the other benchmarks (gsm8k etc.) either, so that would be another question.

Nov 13 '24 11:11 cornzz