optimum-habana
optimum-habana copied to clipboard
refine bucket_internal for mpt
What does this PR do?
The existing bucket_internal
for MPT model implements the processing of the first token. This PR supplements the processing of subsequent tokens. Although throughput has been improved, the changing shape of key_states
and value_states
has increased memory usage.
I am exploring whether dynamic shape can solve this problem. If anyone has any clues, please let me know. Thanks!
python run_generation.py --model_name_or_path mosaicml/mpt-7b --use_hpu_graphs --use_kv_cache --limit_hpu_graph --batch_size 128 --max_input_tokens 128 --max_new_tokens 128 --trim_logits --attn_softmax_bf16 --warmup 3 --n_iterations 1 --bf16 --bucket_internal --bucket_size 32
Test results:
bs/max_input/max_output | Throughput(including tokenization)(tokens/s) | Memory(GB) | Max_Memory(GB) |
---|---|---|---|
before | 128/128/128 | 5996 | 37.04 |
after | 128/128/128 | 6443 | 45.07 |
before | 16/128/1024 | 1322 | 26.4 |
after | 16/128/1024 | 1673 | 30.93 |
before | 32/128/512 | 2527 | 28.02 |
after | 32/128/512 | 2997 | 33.08 |
before | 64/128/256 | 4263 | 31.28 |
after | 64/128/256 | 4816 | 36.94 |
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you make sure to update the documentation with your changes?
- [ ] Did you write any new necessary tests?