optimum-habana icon indicating copy to clipboard operation
optimum-habana copied to clipboard

add fp8 related changes to mistral for text-generation

Open skaulintel opened this issue 10 months ago • 1 comments

What does this PR do?

Initial mistral fp8 change

Command Lines:

  1. 128x128xbs4

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 896 --fp8 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphs

Throughput (including tokenization) = 13250.825658116784 tokens/second Number of HPU graphs = 85 Memory allocated = 38.37 GB Max memory allocated = 94.61 GB Total memory available = 94.62 GB Graph compilation duration = 90.98284676099138 seconds

  1. 2048x128

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120--fp8 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graphs

Throughput (including tokenization) = 1362.8371789032228 tokens/second Number of HPU graphs = 85 Memory allocated = 74.29 GB Max memory allocated = 93.82 GB Total memory available = 94.62 GB Graph compilation duration = 90.72206230499432 seconds

  1. 2048x2048

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 44 --fp8 --max_new_tokens 2048 --max_input_tokens 2048 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 3105.9817365063354 tokens/second Number of HPU graphs = 565 Memory allocated = 84.73 GB Max memory allocated = 94.62 GB Total memory available = 94.62 GB Graph compilation duration = 414.38635561900446 seconds

  1. 128x2048

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --fp8 --max_new_tokens 2048 --max_input_tokens 128 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 7738.114888711109 tokens/second Number of HPU graphs = 565 Memory allocated = 74.97 GB Max memory allocated = 94.61 GB Total memory available = 94.62 GB Graph compilation duration = 405.53613558399957 seconds

skaulintel avatar Apr 23 '24 17:04 skaulintel

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.