LLMLingua icon indicating copy to clipboard operation
LLMLingua copied to clipboard

[Question]: Reproduce LLMLingua-2 results with Mistral-7B

Open xvyaward opened this issue 9 months ago • 4 comments

Describe the issue

First of all, thank you for your great contributions.

I have a similar question to the issue 146, I cannot reproduce the Table 4 results in the LLMLingua-2 paper.

compress model: microsoft/llmlingua-2-xlm-roberta-large-meetingbank (downloaded from hf) llm: mistralai/Mistral-7B-v0.1 (also downloaded from HF, not an instruction-tuned model) Hardware platform: 1 Nvidia A100-80GB

Here are some results from the paper and my reproduced scores:

MeetingBank MeetingBank LongBench
QA summary 2000 token avg. 2000 token narrativeqa multifieldqa_en multifieldqa_zh qasper
LLMLingua-2 76.22 30.18 26.8
Original prompt 66.95 26.26 24.5
LLMLingua-2 reproduced 73.59 29.95 25.65 10.07 36.61 26.47 29.46
Original prompt reproduced 66.05 26.89 26.47 10.05 38.7 31.46 25.67

I'm not sure whether I should include multifieldqa_zh for calculating the average of LongBench singledoc QA scores, but even excluding it gives an inconsistent average score.

Here is the example process that I followed for MeetingBank QA evaluation.

  1. I made meetingbank_test_3qa_pairs_summary_formated.json by modifying format_data.py.
  2. Made compressed_prompt using
python compress.py --load_origin_from ../../../results/meetingbank/origin/meetingbank_test_3qa_pairs_summary_formated.json \
    --model_name microsoft/llmlingua-2-xlm-roberta-large-meetingbank
    --compression_rate 0.33 \
    --force_tokens "\n,?,!,." \
    --save_path ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json
  1. evaluate with
python eval_meetingbank_qa_local_llm.py --load_prompt_from ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json \
    --load_key compressed_prompt \
    --model_name_or_path mistralai/Mistral-7B-v0.1 \
    --save_path ../../../results/meetingbank/llmlingua2/mistral_7b/answer_ratio33_meetingbank_test_3qa_pairs_summary_formated.json

I modified eval_meetingbank_qa.py to make eval_meetingbank_qa_local_llm.py to use the vLLM + local hf mistral-7b model. If there is no problem with the reproduction process, is it possible to share the code for evaluation using mistral 7b? Thank you for reading.

xvyaward avatar May 21 '24 15:05 xvyaward