LLMLingua
LLMLingua copied to clipboard
[Question]: Reproduce LLMLingua-2 results with Mistral-7B
Describe the issue
First of all, thank you for your great contributions.
I have a similar question to the issue 146, I cannot reproduce the Table 4 results in the LLMLingua-2 paper.
compress model: microsoft/llmlingua-2-xlm-roberta-large-meetingbank (downloaded from hf) llm: mistralai/Mistral-7B-v0.1 (also downloaded from HF, not an instruction-tuned model) Hardware platform: 1 Nvidia A100-80GB
Here are some results from the paper and my reproduced scores:
MeetingBank | MeetingBank | LongBench | |||||
---|---|---|---|---|---|---|---|
QA | summary | 2000 token avg. | 2000 token narrativeqa | multifieldqa_en | multifieldqa_zh | qasper | |
LLMLingua-2 | 76.22 | 30.18 | 26.8 | ||||
Original prompt | 66.95 | 26.26 | 24.5 | ||||
LLMLingua-2 reproduced | 73.59 | 29.95 | 25.65 | 10.07 | 36.61 | 26.47 | 29.46 |
Original prompt reproduced | 66.05 | 26.89 | 26.47 | 10.05 | 38.7 | 31.46 | 25.67 |
I'm not sure whether I should include multifieldqa_zh for calculating the average of LongBench singledoc QA scores, but even excluding it gives an inconsistent average score.
Here is the example process that I followed for MeetingBank QA evaluation.
- I made meetingbank_test_3qa_pairs_summary_formated.json by modifying format_data.py.
- Made compressed_prompt using
python compress.py --load_origin_from ../../../results/meetingbank/origin/meetingbank_test_3qa_pairs_summary_formated.json \
--model_name microsoft/llmlingua-2-xlm-roberta-large-meetingbank
--compression_rate 0.33 \
--force_tokens "\n,?,!,." \
--save_path ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json
- evaluate with
python eval_meetingbank_qa_local_llm.py --load_prompt_from ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json \
--load_key compressed_prompt \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--save_path ../../../results/meetingbank/llmlingua2/mistral_7b/answer_ratio33_meetingbank_test_3qa_pairs_summary_formated.json
I modified eval_meetingbank_qa.py to make eval_meetingbank_qa_local_llm.py to use the vLLM + local hf mistral-7b model. If there is no problem with the reproduction process, is it possible to share the code for evaluation using mistral 7b? Thank you for reading.