[Question]: Reproduce end2end latency results of LLMLingua-2
Describe the issue
@pzs19
I would like to reproduce and expand the end2end latency benchmark results of the LLMLingua-2 paper and was therefore wondering if you could provide more details on your experiment setup? Specifically:
- Which target LLM was evaluated (and how was it set up, was vLLM or similar used?)
- For the result in Table 5, which prompt length was used, what was the prompt?
- Whats the definition of end2end latency? From the beginning of compression until the generation of the first token or until the full response is generated?
- What was
max_tokenset to, and did you enforce the generation of a minimum number of tokens?
Thanks a lot!
Thank you for raising the questions. There is point to point response:
- The target LLM is GPT-3.5-Turbo-0613, so vllm is not used.
- The latency experiment is conducted on the summarization task of MeetingBank, the prompt follows the main experiment.
- End2end latency counts from the beginning of compression until the full response is generated.
- We set the "max_token" to 400, following the main experiment.
Thank you very much! 🙂
@pzs19 @iofu728 sorry, a follow up question: which LLM was used for compression in the end-to-end latency benchmark of the original LLMLingua paper? Under "Implementation Details" it says
In our experiments, we utilize either Alpaca-7B4 or GPT2-Alpaca as the small pre-trained language model M𝑠 for compression.
however, as far as I can see, it is not specified which of those two models was used for the end-to-end latency benchmark. Actually it is not specified which compressor was used for the other benchmarks (gsm8k etc.) either, so that would be another question.