refine bucket_internal for mpt

Open Jing1Ling opened this issue 7 months ago • 4 comments

What does this PR do?

The existing bucket_internal for MPT model implements the processing of the first token. This PR supplements the processing of subsequent tokens. Although throughput has been improved, the changing shape of key_states and value_states has increased memory usage.

I am exploring whether dynamic shape can solve this problem. If anyone has any clues, please let me know. Thanks!

python run_generation.py --model_name_or_path mosaicml/mpt-7b --use_hpu_graphs --use_kv_cache --limit_hpu_graph --batch_size 128  --max_input_tokens 128 --max_new_tokens 128 --trim_logits --attn_softmax_bf16 --warmup 3 --n_iterations 1 --bf16 --bucket_internal --bucket_size 32

Test results:

bs/max_input/max_output	Throughput(including tokenization)(tokens/s)	Memory(GB)	Max_Memory(GB)
before	128/128/128	5996	37.04
after	128/128/128	6443	45.07
before	16/128/1024	1322	26.4
after	16/128/1024	1673	30.93
before	32/128/512	2527	28.02
after	32/128/512	2997	33.08
before	64/128/256	4263	31.28
after	64/128/256	4816	36.94

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you make sure to update the documentation with your changes?
[ ] Did you write any new necessary tests?

Aug 02 '24 13:08 Jing1Ling

optimum-habana optimum-habana copied to clipboard

refine bucket_internal for mpt

What does this PR do?

Before submitting

optimum-habana
optimum-habana copied to clipboard