add fp8 related changes to mistral for text-generation

Open skaulintel opened this issue 10 months ago • 1 comments

What does this PR do?

Initial mistral fp8 change

Command Lines:

128x128xbs4

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 896 --fp8 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphs

Throughput (including tokenization) = 13250.825658116784 tokens/second Number of HPU graphs = 85 Memory allocated = 38.37 GB Max memory allocated = 94.61 GB Total memory available = 94.62 GB Graph compilation duration = 90.98284676099138 seconds

2048x128

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120--fp8 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graphs

Throughput (including tokenization) = 1362.8371789032228 tokens/second Number of HPU graphs = 85 Memory allocated = 74.29 GB Max memory allocated = 93.82 GB Total memory available = 94.62 GB Graph compilation duration = 90.72206230499432 seconds

2048x2048

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 44 --fp8 --max_new_tokens 2048 --max_input_tokens 2048 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 3105.9817365063354 tokens/second Number of HPU graphs = 565 Memory allocated = 84.73 GB Max memory allocated = 94.62 GB Total memory available = 94.62 GB Graph compilation duration = 414.38635561900446 seconds

128x2048

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --fp8 --max_new_tokens 2048 --max_input_tokens 128 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 7738.114888711109 tokens/second Number of HPU graphs = 565 Memory allocated = 74.97 GB Max memory allocated = 94.61 GB Total memory available = 94.62 GB Graph compilation duration = 405.53613558399957 seconds

Apr 23 '24 17:04 skaulintel

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Apr 23 '24 17:04 HuggingFaceDocBuilderDev

optimum-habana optimum-habana copied to clipboard

add fp8 related changes to mistral for text-generation

What does this PR do?

Command Lines:

optimum-habana
optimum-habana copied to clipboard