optimum-habana icon indicating copy to clipboard operation
optimum-habana copied to clipboard

refine use cache for mpt model

Open Jing1Ling opened this issue 7 months ago • 10 comments

What does this PR do?

Modified the kv_cache initialization method and optimized performance. Co-author: @atakaha Test command:

python run_generation.py --model_name_or_path mosaicml/mpt-7b --use_hpu_graphs --use_kv_cache --limit_hpu_graph --batch_size 128  --max_input_tokens 128 --max_new_tokens 128 --trim_logits --attn_softmax_bf16 --warmup 3 --n_iterations 1 --bf16

Result:

Version batchsize max input tokens max new tokens Throughput (including tokenization)(tokens/s) Memory allocated(GB) Max memory allocated(GB)
before 128 128 128 2900 37.28 64.39
after 128 128 128 4803 29.28 48.39
before 16 128 1024 624 26.55 41.79
after 16 128 1024 1215 23.04 33.77
before 2 1024 1024 197 16.27 19.66
after 2 1024 1024 255 15.27 17.66
before 16 1024 1024 384 40.78 67.86
after 16 1024 1024 846 32.78 51.86

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ ] Did you make sure to update the documentation with your changes?
  • [ ] Did you write any new necessary tests?

Jing1Ling avatar Jul 25 '24 07:07 Jing1Ling