LLMLingua [Question]: Reproduce LLMLingua-2 results with Mistral-7B

[Question]: Reproduce LLMLingua-2 results with Mistral-7B

Open xvyaward opened this issue 9 months ago • 4 comments

Describe the issue

First of all, thank you for your great contributions.

I have a similar question to the issue 146, I cannot reproduce the Table 4 results in the LLMLingua-2 paper.

compress model: microsoft/llmlingua-2-xlm-roberta-large-meetingbank (downloaded from hf) llm: mistralai/Mistral-7B-v0.1 (also downloaded from HF, not an instruction-tuned model) Hardware platform: 1 Nvidia A100-80GB

Here are some results from the paper and my reproduced scores:

	MeetingBank	MeetingBank	LongBench
	QA	summary	2000 token avg.	2000 token narrativeqa	multifieldqa_en	multifieldqa_zh	qasper
LLMLingua-2	76.22	30.18	26.8
Original prompt	66.95	26.26	24.5
LLMLingua-2 reproduced	73.59	29.95	25.65	10.07	36.61	26.47	29.46
Original prompt reproduced	66.05	26.89	26.47	10.05	38.7	31.46	25.67

I'm not sure whether I should include multifieldqa_zh for calculating the average of LongBench singledoc QA scores, but even excluding it gives an inconsistent average score.

Here is the example process that I followed for MeetingBank QA evaluation.

I made meetingbank_test_3qa_pairs_summary_formated.json by modifying format_data.py.
Made compressed_prompt using

python compress.py --load_origin_from ../../../results/meetingbank/origin/meetingbank_test_3qa_pairs_summary_formated.json \
    --model_name microsoft/llmlingua-2-xlm-roberta-large-meetingbank
    --compression_rate 0.33 \
    --force_tokens "\n,?,!,." \
    --save_path ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json

evaluate with

python eval_meetingbank_qa_local_llm.py --load_prompt_from ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json \
    --load_key compressed_prompt \
    --model_name_or_path mistralai/Mistral-7B-v0.1 \
    --save_path ../../../results/meetingbank/llmlingua2/mistral_7b/answer_ratio33_meetingbank_test_3qa_pairs_summary_formated.json

I modified eval_meetingbank_qa.py to make eval_meetingbank_qa_local_llm.py to use the vLLM + local hf mistral-7b model. If there is no problem with the reproduction process, is it possible to share the code for evaluation using mistral 7b? Thank you for reading.

May 21 '24 15:05 xvyaward

LLMLingua LLMLingua copied to clipboard

[Question]: Reproduce LLMLingua-2 results with Mistral-7B

Describe the issue

LLMLingua
LLMLingua copied to clipboard