sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Track] DeepSeek V3/R1 accuracy

Open zhyncs opened this issue 10 months ago • 3 comments

conclusion

gsm8k and mmlu are completely consistent with the official release

server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code

gsm8k

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
Accuracy: 0.955
Invalid: 0.000
Latency: 109.212 s
Output throughput: 1244.611 token/s

mmlu

bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000
subject: abstract_algebra, #q:100, acc: 0.750
subject: anatomy, #q:135, acc: 0.844
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.870
subject: clinical_knowledge, #q:265, acc: 0.921
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.630
subject: college_computer_science, #q:100, acc: 0.860
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.884
subject: college_physics, #q:102, acc: 0.833
subject: computer_security, #q:100, acc: 0.880
subject: conceptual_physics, #q:235, acc: 0.928
subject: econometrics, #q:114, acc: 0.754
subject: electrical_engineering, #q:145, acc: 0.883
subject: elementary_mathematics, #q:378, acc: 0.942
subject: formal_logic, #q:126, acc: 0.794
subject: global_facts, #q:100, acc: 0.670
subject: high_school_biology, #q:310, acc: 0.955
subject: high_school_chemistry, #q:203, acc: 0.847
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.891
subject: high_school_geography, #q:198, acc: 0.965
subject: high_school_government_and_politics, #q:193, acc: 0.990
subject: high_school_macroeconomics, #q:390, acc: 0.921
subject: high_school_mathematics, #q:270, acc: 0.756
subject: high_school_microeconomics, #q:238, acc: 0.966
subject: high_school_physics, #q:151, acc: 0.828
subject: high_school_psychology, #q:545, acc: 0.971
subject: high_school_statistics, #q:216, acc: 0.856
subject: high_school_us_history, #q:204, acc: 0.956
subject: high_school_world_history, #q:237, acc: 0.945
subject: human_aging, #q:223, acc: 0.852
subject: human_sexuality, #q:131, acc: 0.939
subject: international_law, #q:121, acc: 0.959
subject: jurisprudence, #q:108, acc: 0.917
subject: logical_fallacies, #q:163, acc: 0.920
subject: machine_learning, #q:112, acc: 0.786
subject: management, #q:103, acc: 0.932
subject: marketing, #q:234, acc: 0.949
subject: medical_genetics, #q:100, acc: 0.940
subject: miscellaneous, #q:783, acc: 0.957
subject: moral_disputes, #q:346, acc: 0.887
subject: moral_scenarios, #q:895, acc: 0.773
subject: nutrition, #q:306, acc: 0.915
subject: philosophy, #q:311, acc: 0.897
subject: prehistory, #q:324, acc: 0.935
subject: professional_accounting, #q:282, acc: 0.865
subject: professional_law, #q:1534, acc: 0.702
subject: professional_medicine, #q:272, acc: 0.949
subject: professional_psychology, #q:612, acc: 0.913
subject: public_relations, #q:110, acc: 0.836
subject: security_studies, #q:245, acc: 0.890
subject: sociology, #q:201, acc: 0.960
subject: us_foreign_policy, #q:100, acc: 0.930
subject: virology, #q:166, acc: 0.584
subject: world_religions, #q:171, acc: 0.924
Total latency: 274.759
Average accuracy: 0.871

zhyncs avatar Feb 11 '25 10:02 zhyncs

some 8 * H20 accuracy for deepseek-v3, cc: @zhyncs

Server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --mem-fraction-static 0.9

gsmk8

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
Accuracy: 0.950
Invalid: 0.000
Latency: 236.747 s
Output throughput: 587.916 token/s

mmlu

bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000
subject: abstract_algebra, #q:100, acc: 0.820
subject: anatomy, #q:135, acc: 0.881
subject: astronomy, #q:152, acc: 0.934
subject: business_ethics, #q:100, acc: 0.870
subject: clinical_knowledge, #q:265, acc: 0.917
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.650
subject: college_computer_science, #q:100, acc: 0.830
subject: college_mathematics, #q:100, acc: 0.800
subject: college_medicine, #q:173, acc: 0.867
subject: college_physics, #q:102, acc: 0.814
subject: computer_security, #q:100, acc: 0.890
subject: conceptual_physics, #q:235, acc: 0.949
subject: econometrics, #q:114, acc: 0.807
subject: electrical_engineering, #q:145, acc: 0.876
subject: elementary_mathematics, #q:378, acc: 0.944
subject: formal_logic, #q:126, acc: 0.810
subject: global_facts, #q:100, acc: 0.730
subject: high_school_biology, #q:310, acc: 0.958
subject: high_school_chemistry, #q:203, acc: 0.897
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.885
subject: high_school_geography, #q:198, acc: 0.960
subject: high_school_government_and_politics, #q:193, acc: 0.990
subject: high_school_macroeconomics, #q:390, acc: 0.931
subject: high_school_mathematics, #q:270, acc: 0.752
subject: high_school_microeconomics, #q:238, acc: 0.954
subject: high_school_physics, #q:151, acc: 0.834
subject: high_school_psychology, #q:545, acc: 0.961
subject: high_school_statistics, #q:216, acc: 0.861
subject: high_school_us_history, #q:204, acc: 0.961
subject: high_school_world_history, #q:237, acc: 0.949
subject: human_aging, #q:223, acc: 0.870
subject: human_sexuality, #q:131, acc: 0.924
subject: international_law, #q:121, acc: 0.975
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.914
subject: machine_learning, #q:112, acc: 0.857
subject: management, #q:103, acc: 0.961
subject: marketing, #q:234, acc: 0.962
subject: medical_genetics, #q:100, acc: 0.960
subject: miscellaneous, #q:783, acc: 0.962
subject: moral_disputes, #q:346, acc: 0.864
subject: moral_scenarios, #q:895, acc: 0.806
subject: nutrition, #q:306, acc: 0.922
subject: philosophy, #q:311, acc: 0.929
subject: prehistory, #q:324, acc: 0.935
subject: professional_accounting, #q:282, acc: 0.869
subject: professional_law, #q:1534, acc: 0.720
subject: professional_medicine, #q:272, acc: 0.952
subject: professional_psychology, #q:612, acc: 0.907
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.869
subject: sociology, #q:201, acc: 0.945
subject: us_foreign_policy, #q:100, acc: 0.950
subject: virology, #q:166, acc: 0.578
subject: world_religions, #q:171, acc: 0.930
Total latency: 435.171
Average accuracy: 0.878

FlamingoPg avatar Feb 11 '25 13:02 FlamingoPg

It seems that deepseek-V3/R1 using sglang cannot achieve the 88.5/90.8 accuracy of MMLU claimed in the paper? I wonder how to reproduce the MMLU accuracy in the paper.

ictzyqq avatar Feb 17 '25 03:02 ictzyqq

Test EP8 DeepSeek-V3 accuracy cc: @zhyncs @sleepcoo , for this pr : https://github.com/sgl-project/sglang/pull/3602

Device

8 * H200

Server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --enable-dp-attention --enable-ep-moe

gsmk8

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
Accuracy: 0.952
Invalid: 0.000
Latency: 154.554 s
Output throughput: 893.419 token/s

mmlu

bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000
subject: abstract_algebra, #q:100, acc: 0.790
subject: anatomy, #q:135, acc: 0.881
subject: astronomy, #q:152, acc: 0.921
subject: business_ethics, #q:100, acc: 0.880
subject: clinical_knowledge, #q:265, acc: 0.921
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.660
subject: college_computer_science, #q:100, acc: 0.850
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.867
subject: college_physics, #q:102, acc: 0.843
subject: computer_security, #q:100, acc: 0.890
subject: conceptual_physics, #q:235, acc: 0.945
subject: econometrics, #q:114, acc: 0.789
subject: electrical_engineering, #q:145, acc: 0.883
subject: elementary_mathematics, #q:378, acc: 0.944
subject: formal_logic, #q:126, acc: 0.825
subject: global_facts, #q:100, acc: 0.690
subject: high_school_biology, #q:310, acc: 0.958
subject: high_school_chemistry, #q:203, acc: 0.887
subject: high_school_computer_science, #q:100, acc: 0.930
subject: high_school_european_history, #q:165, acc: 0.885
subject: high_school_geography, #q:198, acc: 0.955
subject: high_school_government_and_politics, #q:193, acc: 0.984
subject: high_school_macroeconomics, #q:390, acc: 0.926
subject: high_school_mathematics, #q:270, acc: 0.759
subject: high_school_microeconomics, #q:238, acc: 0.958
subject: high_school_physics, #q:151, acc: 0.834
subject: high_school_psychology, #q:545, acc: 0.960
subject: high_school_statistics, #q:216, acc: 0.847
subject: high_school_us_history, #q:204, acc: 0.961
subject: high_school_world_history, #q:237, acc: 0.949
subject: human_aging, #q:223, acc: 0.861
subject: human_sexuality, #q:131, acc: 0.924
subject: international_law, #q:121, acc: 0.975
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.908
subject: machine_learning, #q:112, acc: 0.848
subject: management, #q:103, acc: 0.942
subject: marketing, #q:234, acc: 0.957
subject: medical_genetics, #q:100, acc: 0.950
subject: miscellaneous, #q:783, acc: 0.958
subject: moral_disputes, #q:346, acc: 0.873
subject: moral_scenarios, #q:895, acc: 0.800
subject: nutrition, #q:306, acc: 0.915
subject: philosophy, #q:311, acc: 0.913
subject: prehistory, #q:324, acc: 0.932
subject: professional_accounting, #q:282, acc: 0.876
subject: professional_law, #q:1534, acc: 0.716
subject: professional_medicine, #q:272, acc: 0.949
subject: professional_psychology, #q:612, acc: 0.908
subject: public_relations, #q:110, acc: 0.800
subject: security_studies, #q:245, acc: 0.882
subject: sociology, #q:201, acc: 0.950
subject: us_foreign_policy, #q:100, acc: 0.950
subject: virology, #q:166, acc: 0.578
subject: world_religions, #q:171, acc: 0.930
Total latency: 1153.812
Average accuracy: 0.876

FlamingoPg avatar Feb 19 '25 14:02 FlamingoPg

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar Apr 21 '25 00:04 github-actions[bot]