Baichuan-7B lm-evaluation-harness中文项目能力测试结果，对比WizardLM[Question]

lm-evaluation-harness中文项目能力测试结果，对比WizardLM[Question]

Open ishotoli opened this issue 1 year ago • 1 comments

Required prerequisites

[X] I have read the documentation https://github.com/baichuan-inc/baichuan-7B/blob/HEAD/README.md.
[X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
[X] Consider asking first in a Discussion.

Questions

感谢百川团队的贡献，为了对比 baichuan-7B 的中文能力，我选择了 lm-evaluation-harness 当中的中文测试项目 xwinograd_zh,xnli_zh,xcopa_zh,xstory_cloze_zh,mgsm_zh，其中xwinograd_zh,xnli_zh,xcopa_zh,xstory_cloze_zh倾向于推理，mgsm_zh倾向于数学。我进行了两次测试，一次是num_fewshot为0，一次num_fewshot为5。需要提到的是因为 lm-evaluation-harness 默认不支持tokenizer的trust_remote_code，为了运行起来不得不小小hack了一下，其余均保持原样。

结果如下： hf-causal-experimental (pretrained=/models/baichuan-inc_baichuan-7B/,trust_remote_code=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Metric	Value		Stderr
mgsm_zh	acc	0.0360	±	0.0118
xcopa_zh	acc	0.6700	±	0.0210
xstory_cloze_zh	acc	0.6320	±	0.0124
xwinograd_zh	acc	0.7857	±	0.0183
xnli_zh	acc	0.3818	±	0.0069

hf-causal-experimental (pretrained=/models/baichuan-inc_baichuan-7B/,trust_remote_code=True), limit: None, provide_description: False, num_fewshot: 5, batch_size: None

Task	Metric	Value		Stderr
mgsm_zh	acc	0.0960	±	0.0187
xcopa_zh	acc	0.7240	±	0.0200
xstory_cloze_zh	acc	0.6565	±	0.0122
xwinograd_zh	acc	0.8016	±	0.0178
xnli_zh	acc	0.4341	±	0.0070

对比WizardLM-7B的中文能力： hf-causal-experimental (pretrained=/models/TheBloke_WizardLM-7B-uncensored-GPTQ/,quantized=WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors,gptq_use_triton=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Metric	Value		Stderr
mgsm_zh	acc	0.0280	±	0.0105
xcopa_zh	acc	0.5340	±	0.0223
xstory_cloze_zh	acc	0.5162	±	0.0129
xwinograd_zh	acc	0.5417	±	0.0222
xnli_zh	acc	0.3439	±	0.0067

hf-causal-experimental (pretrained=/models/TheBloke_WizardLM-7B-uncensored-GPTQ/,quantized=WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors,gptq_use_triton=True), limit: None, provide_description: False, num_fewshot: 5, batch_size: None

Task	Metric	Value		Stderr
mgsm_zh	acc	0.0360	±	0.0118
xcopa_zh	acc	0.5420	±	0.0223
xstory_cloze_zh	acc	0.5242	±	0.0129
xwinograd_zh	acc	0.6071	±	0.0218
xnli_zh	acc	0.3599	±	0.0068

对比可以看到中文能力相比LLAMA系列的衍生品的确提高了很多，希望百川团队越做越好！

Checklist

[X] I have provided all relevant and necessary information above.
[X] I have chosen a suitable title for this issue.

Jun 17 '23 22:06 ishotoli

感谢分享。但是wizardlm似乎做了4位量化而baichuan没有q，可能要考虑到这个差异对实验结果的影响。

Jun 18 '23 10:06 0xDing

Baichuan-7B Baichuan-7B copied to clipboard

lm-evaluation-harness中文项目能力测试结果，对比WizardLM[Question]

Required prerequisites

Questions

Checklist

Baichuan-7B
Baichuan-7B copied to clipboard