MNN
MNN copied to clipboard
Use n_gen from context, fix tps mismatch in gemma3
This PR is trying to address #3947 . When benchmarking gemma3, this model generates eos token very quick, i.e. 3~4 tokens, but we still calculate the decoding tps by 128. So this model displays a very high decoding speed.
before fix:
| model | modelSize | backend | threads | precision | llm_demo | speed(tok/s) |
|---|---|---|---|---|---|---|
| gemma-3-1b-it-qat-q4_0-gguf-MNN | 994.65 MiB | CPU | 4 | Low | prompt=128 decode=128 |
45.88 ± 0.54 316.10 ± 2.58 |
| gemma-3-1b-it-qat-q4_0-gguf-MNN | 994.65 MiB | CPU | 4 | Low | prompt=256 decode=128 |
45.63 ± 0.53 311.16 ± 1.67 |
| gemma-3-1b-it-qat-q4_0-gguf-MNN | 994.65 MiB | CPU | 4 | Low | prompt=512 decode=128 |
45.00 ± 0.34 11.91 ± 0.14 |
after fix:
| model | modelSize | backend | threads | precision | llm_demo | speed(tok/s) |
|---|---|---|---|---|---|---|
| gemma-3-1b-it-qat-q4_0-gguf-MNN | 994.65 MiB | CPU | 4 | Low | prompt=128 decode=128 |
44.90 ± 0.46 12.28 ± 0.22 |
| gemma-3-1b-it-qat-q4_0-gguf-MNN | 994.65 MiB | CPU | 4 | Low | prompt=256 decode=128 |
45.41 ± 0.14 12.22 ± 0.04 |
| gemma-3-1b-it-qat-q4_0-gguf-MNN | 994.65 MiB | CPU | 4 | Low | prompt=512 decode=128 |
45.00 ± 0.08 12.04 ± 0.02 |
Please resolve conficts and we can merge it.