What does this PR do?

Support long sequences 32k with bs4 (move Q slicing inside loop to save memory)

Before OutofMemory

After Basic Command (with --limit_hpu_graphs, --reuse_cache, --bucket_internal, --batch_size 4)

QUANT_CONFIG=./quantization_config/maxabs_quant_mixtral.json python run_generation.py --model_name_or_path mistralai/Mixtral-8x7B-v0.1 --use_hpu_graphs --limit_hpu_graphs --use_kv_cache --reuse_cache --bucket_internal --bucket_size ${bucket_size} --max_new_tokens ${max_new_tokens} --bf16 --fp8 --batch_size 4 --max_input_tokens 32000

Test case 1: --bucket_size 128 --max_new_tokens 128 input 4: ('He got all',) output 1: ('He got all the way to the top of the mountain, but he couldn’t get over the last hurdle.\n\nThe 2018 World Cup is over for Cristiano Ronaldo.\n\nThe Portugal star was unable to help his team advance to the quarterfinals, as Uruguay defeated Portugal 2-1 in the round of 16 on Saturday.\n\nRonaldo, 33, has never won a World Cup. He’s never even made it to the semifinals.\n\nRonaldo has won the Champions League five times, the Ballon d’Or',)

Stats: Throughput (including tokenization) = 8.158104216323297 tokens/second Number of HPU graphs = 337 Memory allocated = 69.49 GB Max memory allocated = 94.41 GB Total memory available = 94.62 GB Graph compilation duration = 464.618659114989 seconds

Test case 2: --bucket_size 256 --max_new_tokens 512 input 4: ('He got all',) output 1: ('He got all the way to the top of the mountain, but he couldn’t get over the last hurdle.\n\nThe 2018 World Cup is over for Cristiano Ronaldo.\n\nThe Portugal star was unable to help his team advance to the quarterfinals, as Uruguay defeated Portugal 2-1 in the round of 16 on Saturday.\n\nRonaldo, 33, has never won a World Cup. He’s never even made it to the semifinals.\n\nRonaldo has won the Champions League five times, the Ballon d’Or five times, the European Championship once, and the European Golden Shoe four times.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 85 goals in 154 appearances for Portugal. He’s scored 450 goals in 438 appearances for Real Madrid.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying.',)

Stats: Throughput (including tokenization) = 27.221884640574576 tokens/second Number of HPU graphs = 369 Memory allocated = 69.68 GB Max memory allocated = 94.58 GB Total memory available = 94.62 GB Graph compilation duration = 547.4480734140379 seconds

Test case 3: --bucket_size 256 --max_new_tokens 700 input 4: ('He got all',) output 1: ('He got all the way to the top of the mountain, but he couldn’t get over the last hurdle.\n\nThe 2018 World Cup is over for Cristiano Ronaldo.\n\nThe Portugal star was unable to help his team advance to the quarterfinals, as Uruguay defeated Portugal 2-1 in the round of 16 on Saturday.\n\nRonaldo, 33, has never won a World Cup. He’s never even made it to the semifinals.\n\nRonaldo has won the Champions League five times, the Ballon d’Or five times, the European Championship once, and the European Golden Shoe four times.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 85 goals in 154 appearances for Portugal. He’s scored 450 goals in 438 appearances for Real Madrid.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.',)

Stats: Throughput (including tokenization) = 34.39802785860282 tokens/second Number of HPU graphs = 401 Memory allocated = 69.78 GB Max memory allocated = 94.6 GB Total memory available = 94.62 GB Graph compilation duration = 670.873160321964 seconds

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you make sure to update the documentation with your changes?
[ ] Did you write any new necessary tests?

Apr 18 '24 10:04 jychen21

Break PR https://github.com/huggingface/optimum-habana/pull/836 into small pieces, based on PR https://github.com/huggingface/optimum-habana/pull/901

Apr 18 '24 10:04 jychen21

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Apr 18 '24 10:04 HuggingFaceDocBuilderDev

@jychen-habana , please test rope_scaling with Mixtral and update the results here.

Apr 29 '24 19:04 mandy-li

@jychen-habana , please test rope_scaling with Mixtral and update the results here.

Run with rope_scaling (add below to config.json): "rope_scaling": {"type":"linear","factor":2.0},

Test case: --max_input_tokens 32000 --bucket_size 1024 --max_new_tokens 512 --batch_size 1 Input/outputs: input 1: ('DeepSpeed is a machine learning framework',) output 1: ('DeepSpeed is a machine learning framework that is developed by Microsoft to train large models with billions of parameters. It is a library that is built on top of PyTorcharm and is designed to train large models with billions of parameters. It is a library is built on top of PyTorcharm and is designed to train large models with billions of parameters. It is library is built on top Pycharm and designed to train large models with billions of parameters. It is library is built on Pycharm designed to train large models billions of parameters. It is library built on Pyarm to train large models billions parameters. It built Pyarm to train models billions. It arm train models. It train.\n\n\nDeepSpeed is a machine learning framework is developed by Microsoft to train large models with billions of parameters. It is a library is built on top PyTorarm is designed to train large models with billions parameters. It is library built on Pyarm is designed to train large models billions. It library is built arm to train billions. It is built to train.\n\n\nDeepSpeed is a machine framework is developed by Microsoft to train models billions. It is library is built Pyarm to train billions. It is built arm to train.\n\n\nDeep is machine framework developed Microsoft train billions. is library arm train.\n\n\nDeep is framework Microsoft billions.\n\n\nDeep is Microsoft\n\n\n...',)

Stats: Throughput (including tokenization) = 18.225460417191343 tokens/second Number of HPU graphs = 342 Memory allocated = 51.91 GB Max memory allocated = 94.56 GB Total memory available = 94.62 GB Graph compilation duration = 314.0399896269664 seconds

Run with rope_scaling (add below to config.json): "rope_scaling": {"type":"dynamic","factor":2.0},

Test case: --max_input_tokens 32000 --bucket_size 1024 --max_new_tokens 512 --batch_size 1 Input/outputs: input 1: ('DeepSpeed is a machine learning framework',) output 1: ('DeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It is designed to be easy to use and efficient, and it supports a wide range of models and tasks.\n\n## What is DeepSpeed?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It is designed to be easy to use and efficient, and it supports a wide range of models and tasks.\n\n## How does DeepSpeed work?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It is designed to be easy to use and efficient, and it supports a wide range of models and tasks.\n\n## What are the benefits of using DeepSpeed?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It is designed to be easy to use and efficient, and it supports a wide range of models and tasks.\n\n## How can I get started with DeepSpeed?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It is designed to be easy to use and efficient, and it supports a wide range of models and tasks.\n\n## What are the limitations of DeepSpeed?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It is designed to be easy to use and efficient, and it supports a wide range of models and tasks. However, there are some limitations to DeepSpeed.\n\nFirst, DeepSpeed is only compatible with certain types of models. It does not support all types of models, so you may need to use another framework if you want to train a model that is not supported by DeepSpeed.\n\nSecond, DeepSpeed is only compatible with certain types of hardware. It requires 8 GPUs to work properly, so you will need to have access to 8 GPUs in order to use DeepSpeed.\n\nThird, DeepSpeed is only compatible with certain types of software. It requires the use of certain libraries in order to work properly, so you will need to have these libraries installed in order to use DeepSpeed.\n\n## How does DeepSpeed compare to other machine learning frameworks?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It',)

Stats: Throughput (including tokenization) = 18.225313735944003 tokens/second Number of HPU graphs = 342 Memory allocated = 51.91 GB Max memory allocated = 94.56 GB Total memory available = 94.62 GB Graph compilation duration = 310.87937180604786 seconds

May 08 '24 05:05 jychen21

@regisss @libinta @mandy-li please help review and merge this PR, thanks!

May 08 '24 09:05 jychen21

optimum-habana
optimum-habana copied to clipboard

Support mixtral long sequence 32k with bs 4

What does this PR do?

Before submitting

optimum-habana optimum-habana copied to clipboard

Support mixtral long sequence 32k with bs 4

What does this PR do?

Before submitting

optimum-habana
optimum-habana copied to clipboard