optimum-habana
optimum-habana copied to clipboard
add fp8 related changes to mistral for text-generation
What does this PR do?
Initial mistral fp8 change
Command Lines:
- 128x128xbs4
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 896 --fp8 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphs
Throughput (including tokenization) = 13250.825658116784 tokens/second Number of HPU graphs = 85 Memory allocated = 38.37 GB Max memory allocated = 94.61 GB Total memory available = 94.62 GB Graph compilation duration = 90.98284676099138 seconds
- 2048x128
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120--fp8 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graphs
Throughput (including tokenization) = 1362.8371789032228 tokens/second Number of HPU graphs = 85 Memory allocated = 74.29 GB Max memory allocated = 93.82 GB Total memory available = 94.62 GB Graph compilation duration = 90.72206230499432 seconds
- 2048x2048
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 44 --fp8 --max_new_tokens 2048 --max_input_tokens 2048 --bucket_internal --bucket_size 128 --limit_hpu_graphs
Throughput (including tokenization) = 3105.9817365063354 tokens/second Number of HPU graphs = 565 Memory allocated = 84.73 GB Max memory allocated = 94.62 GB Total memory available = 94.62 GB Graph compilation duration = 414.38635561900446 seconds
- 128x2048
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --fp8 --max_new_tokens 2048 --max_input_tokens 128 --bucket_internal --bucket_size 128 --limit_hpu_graphs
Throughput (including tokenization) = 7738.114888711109 tokens/second Number of HPU graphs = 565 Memory allocated = 74.97 GB Max memory allocated = 94.61 GB Total memory available = 94.62 GB Graph compilation duration = 405.53613558399957 seconds
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.