refine use cache for mpt model

Open Jing1Ling opened this issue 7 months ago • 10 comments

What does this PR do?

Modified the kv_cache initialization method and optimized performance. Co-author: @atakaha Test command:

python run_generation.py --model_name_or_path mosaicml/mpt-7b --use_hpu_graphs --use_kv_cache --limit_hpu_graph --batch_size 128  --max_input_tokens 128 --max_new_tokens 128 --trim_logits --attn_softmax_bf16 --warmup 3 --n_iterations 1 --bf16

Result:

Version	batchsize	max input tokens	max new tokens	Throughput (including tokenization)(tokens/s)	Memory allocated(GB)	Max memory allocated(GB)
before	128	128	128	2900	37.28	64.39
after	128	128	128	4803	29.28	48.39
before	16	128	1024	624	26.55	41.79
after	16	128	1024	1215	23.04	33.77
before	2	1024	1024	197	16.27	19.66
after	2	1024	1024	255	15.27	17.66
before	16	1024	1024	384	40.78	67.86
after	16	1024	1024	846	32.78	51.86

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you make sure to update the documentation with your changes?
[ ] Did you write any new necessary tests?

Jul 25 '24 07:07 Jing1Ling

optimum-habana optimum-habana copied to clipboard

refine use cache for mpt model

What does this PR do?

Before submitting

optimum-habana
optimum-habana copied to clipboard