LLMLingua icon indicating copy to clipboard operation
LLMLingua copied to clipboard

[Question]: Reproduce LLMLingua-2 on the LongBench SingleDoc dataset

Open 56wangyun opened this issue 1 year ago • 2 comments

Describe the issue

We referred to your code https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/evaluation/compress.py, https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/evaluation/eval_longbench.py

target token: 2000 compresse model: llmlingua-2-bert-base-multilingual-cased-meetingbank llm model: Mistral-7B-Instruct-v0.1 (from https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/tree/main) LongBench singledoc tasks: qasper, multifieldqa_en, narrativeqa Hardware platform: 1 Nvidia A100-80GB

The result is different from the conclusions in the paper (Table 4, LLMLingua-2-small, LongBench-SingleDoc , 2000-token cons.) The compressed prompt evaluation score is: {'qasper': 32.27, 'multifieldqa_en': 33.04, 'narrativeqa': 8.84} average score 24.7 (25.3 in paper)

The uncompressed prompt evaluation score is: {"multifieldqa_en": 37.07, "qasper": 33.83, "narrativeqa": 19.89} 30.3 (24.5 in paper)

What's the experiment settings in the paper and what makes the difference in the evaluation result. Thank you for your reply

56wangyun avatar May 07 '24 11:05 56wangyun

Hi @56wangyun, thanks for your support with LLMLingua-2.

In general, you should be able to reproduce the results of Table 4 by following the steps in eval_longbench.py and compress.py. Could you provide more details, including the codebase for Mistral inference, as well as the coding environment? This information would help ensure accurate replication of the results.

iofu728 avatar May 10 '24 08:05 iofu728

Hi @56wangyun, thanks for providing the detailed information.

I believe the difference in results may indeed be attributed to the use of different Mistral models. As mentioned in the "Mistral-7B as the Target LLM" part of the Experiment section, we utilized the "mistral-7B-v0.1" model (available at https://github.com/mistralai/mistral-src) rather than the "mistral-7B-instruct-v0.1" model as the target model. Hope this information aids in replicating the results.

pzs19 avatar May 10 '24 11:05 pzs19