Baichuan-7B
Baichuan-7B copied to clipboard
lm-evaluation-harness中文项目能力测试结果,对比WizardLM[Question]
Required prerequisites
- [X] I have read the documentation https://github.com/baichuan-inc/baichuan-7B/blob/HEAD/README.md.
- [X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- [X] Consider asking first in a Discussion.
Questions
感谢百川团队的贡献,为了对比 baichuan-7B 的中文能力,我选择了 lm-evaluation-harness 当中的中文测试项目 xwinograd_zh,xnli_zh,xcopa_zh,xstory_cloze_zh,mgsm_zh,其中xwinograd_zh,xnli_zh,xcopa_zh,xstory_cloze_zh倾向于推理,mgsm_zh倾向于数学。我进行了两次测试,一次是num_fewshot为0,一次num_fewshot为5。需要提到的是因为 lm-evaluation-harness 默认不支持tokenizer的trust_remote_code,为了运行起来不得不小小hack了一下,其余均保持原样。
结果如下: hf-causal-experimental (pretrained=/models/baichuan-inc_baichuan-7B/,trust_remote_code=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
mgsm_zh | 0 | acc | 0.0360 | ± | 0.0118 |
xcopa_zh | 0 | acc | 0.6700 | ± | 0.0210 |
xstory_cloze_zh | 0 | acc | 0.6320 | ± | 0.0124 |
xwinograd_zh | 0 | acc | 0.7857 | ± | 0.0183 |
xnli_zh | 0 | acc | 0.3818 | ± | 0.0069 |
hf-causal-experimental (pretrained=/models/baichuan-inc_baichuan-7B/,trust_remote_code=True), limit: None, provide_description: False, num_fewshot: 5, batch_size: None
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
mgsm_zh | 0 | acc | 0.0960 | ± | 0.0187 |
xcopa_zh | 0 | acc | 0.7240 | ± | 0.0200 |
xstory_cloze_zh | 0 | acc | 0.6565 | ± | 0.0122 |
xwinograd_zh | 0 | acc | 0.8016 | ± | 0.0178 |
xnli_zh | 0 | acc | 0.4341 | ± | 0.0070 |
对比WizardLM-7B的中文能力: hf-causal-experimental (pretrained=/models/TheBloke_WizardLM-7B-uncensored-GPTQ/,quantized=WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors,gptq_use_triton=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
mgsm_zh | 0 | acc | 0.0280 | ± | 0.0105 |
xcopa_zh | 0 | acc | 0.5340 | ± | 0.0223 |
xstory_cloze_zh | 0 | acc | 0.5162 | ± | 0.0129 |
xwinograd_zh | 0 | acc | 0.5417 | ± | 0.0222 |
xnli_zh | 0 | acc | 0.3439 | ± | 0.0067 |
hf-causal-experimental (pretrained=/models/TheBloke_WizardLM-7B-uncensored-GPTQ/,quantized=WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors,gptq_use_triton=True), limit: None, provide_description: False, num_fewshot: 5, batch_size: None
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
mgsm_zh | 0 | acc | 0.0360 | ± | 0.0118 |
xcopa_zh | 0 | acc | 0.5420 | ± | 0.0223 |
xstory_cloze_zh | 0 | acc | 0.5242 | ± | 0.0129 |
xwinograd_zh | 0 | acc | 0.6071 | ± | 0.0218 |
xnli_zh | 0 | acc | 0.3599 | ± | 0.0068 |
对比可以看到中文能力相比LLAMA系列的衍生品的确提高了很多,希望百川团队越做越好!
Checklist
- [X] I have provided all relevant and necessary information above.
- [X] I have chosen a suitable title for this issue.
感谢分享。 但是wizardlm似乎做了4位量化而baichuan没有q,可能要考虑到这个差异对实验结果的影响。