optimum-habana icon indicating copy to clipboard operation
optimum-habana copied to clipboard

refine bucket_internal for mpt

Open Jing1Ling opened this issue 7 months ago • 4 comments

What does this PR do?

The existing bucket_internal for MPT model implements the processing of the first token. This PR supplements the processing of subsequent tokens. Although throughput has been improved, the changing shape of key_states and value_states has increased memory usage.

I am exploring whether dynamic shape can solve this problem. If anyone has any clues, please let me know. Thanks!

python run_generation.py --model_name_or_path mosaicml/mpt-7b --use_hpu_graphs --use_kv_cache --limit_hpu_graph --batch_size 128  --max_input_tokens 128 --max_new_tokens 128 --trim_logits --attn_softmax_bf16 --warmup 3 --n_iterations 1 --bf16 --bucket_internal --bucket_size 32

Test results:

bs/max_input/max_output Throughput(including tokenization)(tokens/s) Memory(GB) Max_Memory(GB)
before 128/128/128 5996 37.04
after 128/128/128 6443 45.07
before 16/128/1024 1322 26.4
after 16/128/1024 1673 30.93
before 32/128/512 2527 28.02
after 32/128/512 2997 33.08
before 64/128/256 4263 31.28
after 64/128/256 4816 36.94

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ ] Did you make sure to update the documentation with your changes?
  • [ ] Did you write any new necessary tests?

Jing1Ling avatar Aug 02 '24 13:08 Jing1Ling