opencompass
opencompass copied to clipboard
[Bug] Long text evaluation parameters are not clear
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] The bug has not been fixed in the latest version.
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
python 3.10.1 OpenCompass 0.2.3 vllm 0.2.3
Reproduces the problem - code/configuration sample
configs/models/chatglm/vllm_chatglm2_6b_32k.py from opencompass.models import VLLM
models = [ dict( type=VLLM, abbr='chatglm2-6b-32k-vllm', path='THUDM/chatglm2-6b-32k', max_out_len=512, max_seq_len=4096, batch_size=32, generation_kwargs=dict(temperature=0), run_cfg=dict(num_gpus=1, num_procs=1), ) ]
Reproduces the problem - command or script
python run.py --model vllm_chatglm2_6b_32k --datasets longbench leval
Reproduces the problem - error message
The difference between the evaluation result parameters and the document long text evaluation is about 20 points, The score for the document can not be reproduced.
- “max_seq_len、max_out_len” Should these two parameters be modified in any way?
Other information
No response
For optimal performance, it is advisable to configure the max_seq_len
parameter to the highest value feasible, such as 32768 or even higher if possible. As for the max_out_len
, it typically comes with a preset default value within the dataset configuration. You have the option to adjust this to 256, or you may simply retain the default setting.
Thank you very much. I reproduced most of the scores.
I also need to ask, indicators for rouge1, rouge2,rougeL,rougeLsum subset of the score difference is still very large.
- What is the reason wow?
- What are the indicators used in the rank?
@liushz