MNN icon indicating copy to clipboard operation
MNN copied to clipboard

Use n_gen from context, fix tps mismatch in gemma3

Open hebangwen opened this issue 2 months ago • 1 comments

This PR is trying to address #3947 . When benchmarking gemma3, this model generates eos token very quick, i.e. 3~4 tokens, but we still calculate the decoding tps by 128. So this model displays a very high decoding speed.

before fix:

model modelSize backend threads precision llm_demo speed(tok/s)
gemma-3-1b-it-qat-q4_0-gguf-MNN 994.65 MiB CPU 4 Low prompt=128
decode=128
45.88 ± 0.54
316.10 ± 2.58
gemma-3-1b-it-qat-q4_0-gguf-MNN 994.65 MiB CPU 4 Low prompt=256
decode=128
45.63 ± 0.53
311.16 ± 1.67
gemma-3-1b-it-qat-q4_0-gguf-MNN 994.65 MiB CPU 4 Low prompt=512
decode=128
45.00 ± 0.34
11.91 ± 0.14

after fix:

model modelSize backend threads precision llm_demo speed(tok/s)
gemma-3-1b-it-qat-q4_0-gguf-MNN 994.65 MiB CPU 4 Low prompt=128
decode=128
44.90 ± 0.46
12.28 ± 0.22
gemma-3-1b-it-qat-q4_0-gguf-MNN 994.65 MiB CPU 4 Low prompt=256
decode=128
45.41 ± 0.14
12.22 ± 0.04
gemma-3-1b-it-qat-q4_0-gguf-MNN 994.65 MiB CPU 4 Low prompt=512
decode=128
45.00 ± 0.08
12.04 ± 0.02

hebangwen avatar Oct 17 '25 06:10 hebangwen

Please resolve conficts and we can merge it.

jxt1234 avatar Nov 03 '25 08:11 jxt1234