optimum-habana
optimum-habana copied to clipboard
refine use cache for mpt model
What does this PR do?
Modified the kv_cache initialization method and optimized performance. Co-author: @atakaha Test command:
python run_generation.py --model_name_or_path mosaicml/mpt-7b --use_hpu_graphs --use_kv_cache --limit_hpu_graph --batch_size 128 --max_input_tokens 128 --max_new_tokens 128 --trim_logits --attn_softmax_bf16 --warmup 3 --n_iterations 1 --bf16
Result:
Version | batchsize | max input tokens | max new tokens | Throughput (including tokenization)(tokens/s) | Memory allocated(GB) | Max memory allocated(GB) |
---|---|---|---|---|---|---|
before | 128 | 128 | 128 | 2900 | 37.28 | 64.39 |
after | 128 | 128 | 128 | 4803 | 29.28 | 48.39 |
before | 16 | 128 | 1024 | 624 | 26.55 | 41.79 |
after | 16 | 128 | 1024 | 1215 | 23.04 | 33.77 |
before | 2 | 1024 | 1024 | 197 | 16.27 | 19.66 |
after | 2 | 1024 | 1024 | 255 | 15.27 | 17.66 |
before | 16 | 1024 | 1024 | 384 | 40.78 | 67.86 |
after | 16 | 1024 | 1024 | 846 | 32.78 | 51.86 |
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you make sure to update the documentation with your changes?
- [ ] Did you write any new necessary tests?