optimum-habana
optimum-habana copied to clipboard
Optimized inference of Starcoder2 model
What does this PR do?
Works
- Optimized Starcoder2 model inference.
- Validated
bigcode/starcoder2-3b
inference on single card.
Validation results
static shape generation
python run_generation.py --model_name_or_path bigcode/starcoder2-3b --use_kv_cache --max_new_tokens 100 --bf16 --batch_size 4
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for deep learning.\n\n# How to build\n\n`python-shufflenet` is a python-only implementation, it depends on python libraries only for Python3.\n\n## Build with gcc and g++\n\nIf you are in a UNIX-like environment with gcc and g++ installed, the `Makefile` in `build/Makefile.<os>.gcc` is recommended.\n\n```\ncd build && make -f Makefile.Darwin.gcc\n```\n',)
input 2: ('He is working on',)
output 1: ("He is working on your team and can't wait to meet your success',\n 'He should be flying to the moon with us',\n 'If you do these things you will get paid twice as much as he does',\n 'He and I are having a great time',\n 'He makes so many great people like us',\n 'He thinks that success is being born of hard work and dedication',\n 'He is so excited about working at MindTouch',\n 'He likes the",)
input 3: ('He has a',)
output 1: ('He has a very beautiful hand.") }\n }\n child {\n text(body.text) {\n attr(\'data-testid\', \'body\')\n }\n }\n }\n }\n\n @Test\n fun `verify content`() {\n val expectedBodyText = buildHtml(root)\n val actualBodyText = renderBody(root)\n assertThat(expectedBodyText).isEqualTo(actualBodyText)\n }\n\n @Test\n fun `verify attributes`() {\n subject.verifyAttributes {',)
input 4: ('He got all',)
output 1: ("He got all their money and all the bribes they'd been buying with him and fucked him like the man he wasn't before. This was his first time with, but he was so close to the end. He knew just what to do; he was the one who had fucked his asshole, and the asshole he had fucked was.\n\n I've known this man for ages. He likes to make stuff look good. He never took",)
Stats:
-------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 112.6257038793236 tokens/second
Number of HPU graphs = 6
Memory allocated = 6.17 GB
Max memory allocated = 6.27 GB
Total memory available = 94.62 GB
Graph compilation duration = 22.754922633990645 seconds
-------------------------------------------------------------------------------------------------------------
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you make sure to update the documentation with your changes?
- [x] Did you write any new necessary tests?
Hi, can you please review this PR?
@XinyuYe-Intel , pls 1) provide the performance and output comparison btw Gaudi2 and A100 2) add ci test, thx.
hi, xinyu @XinyuYe-Intel I have add some code patch https://github.com/XinyuYe-Intel/optimum-habana/pull/1 for this starcoder pr, please review~ the changes:
- replace
self._get_generation_mode
togeneration_config.get_generation_mode
because of transformers>=4.39.0 - keep transformers>=4.38.2
StoppingCriteriaList
, because of some error occurs with the changes at transformers>=4.39.0 - remove sdp/flash attention code, force use eager attention because of transformers>=4.39.0 using sdp as default.
- add ci test and starcoder2 model index at the doc
hi @yao-matrix @libinta there is the comparison of inference (Gaudi2 and A100)
Inference/generation performance (Gaudi2 and A100)
single card Gaudi2
python run_generation.py --model_name_or_path bigcode/starcoder2-3b --use_kv_cache --max_new_tokens 100 --bf16 --batch_size 1
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for deep learning.\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#',)
Stats:
-------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 33.99791086852853 tokens/second
Number of HPU graphs = 5
Memory allocated = 6.17 GB
Max memory allocated = 6.19 GB
Total memory available = 94.62 GB
Graph compilation duration = 16.111402586102486 seconds
-------------------------------------------------------------------------------------------------------------
-- use --use_hpu_graph
python run_generation.py --model_name_or_path bigcode/starcoder2-3b --use_kv_cache --max_new_tokens 100 --bf16 --batch_size 1 --use_hpu_graph
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for deep learning.\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#',)
Stats:
--------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 234.2649120507936 tokens/second
Number of HPU graphs = 12
Memory allocated = 6.19 GB
Max memory allocated = 6.2 GB
Total memory available = 94.62 GB
Graph compilation duration = 8.577247043140233 seconds
--------------------------------------------------------------------------------------------------------------
single card A100-80G
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for deep learning.\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#',)
Stats:
----------------------------------------------------------------------
Throughput (including tokenization) = 49.869579574434376 tokens/second
----------------------------------------------------------------------
hi @yao-matrix @libinta there is the comparison of training (Gaudi2 and A100)
validate training (Gaudi2 and A100)
-
bigcode/starcoder2-3b, single card Gaudi2, lora
-
bigcode/starcoder2-7b, single card Gaudi2, lora
-
bigcode/starcoder2-3b, single card A100-80G, lora
note: multi-cards training also works well on gaudi2
hi, @libinta can you help review this pr? Thanks~
what is the transformer version you have when you run tests for this PR? Because I see starcoder2 only in transformer 4.39, and Optimum Habana's version of transformer is less than that for now
Yes, starcoder2 only exists in transformer above 4.39, so we used transformer 4.39.0 and the test runs well, as shown below.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Updated the branch to main (with 4.40 transformer)
python run_generation.py --model_name_or_path bigcode/starcoder2-3b --use_kv_cache --max_new_tokens 100 --bf16 --batch_size 4
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for deep learning.\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#',)
input 2: ('He is working on',)
output 1: ('He is working on a new project.\n\nHe has a lot of work to do, but he is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\n',)
input 3: ('He has a',)
output 1: ('He has a 100% chance of winning the lottery.\n\nThe probability of him winning the lottery is 1/100.\n\nThe probability of him not winning the lottery is 99/100.\n\nThe probability of him winning the lottery is 1/100.\n\nThe probability of him not winning the lottery is 99/100.\n\nThe probability of him win',)
input 4: ('He got all',)
output 1: ('He got all the way to the end of the line.\n\nThe first thing I did was to check the documentation for the `get_current_line()` function.\n\n```\nget_current_line()\n\n```\n\n> \n> Returns the current line number.\n> \n> \n> \n\nSo I tried to use it.\n\n```\nprint(get_current_line())\n\n```\n\nAnd it printed `0`.\n\nI',)
Throughput (including tokenization) = 111.32884028982812 tokens/second
Number of HPU graphs = 5
Memory allocated = 6.17 GB
Max memory allocated = 6.26 GB
Total memory available = 94.62 GB
Graph compilation duration = 17.493904751027003 seconds
@lkk12014402 Please share the LORA commands.
I am getting an error with 1.16 and this PR for finetuning.
python run_lora_clm.py --model_name_or_path bigcode/starcoder2-15b-instruct-v0.1 --dataset_name timdettmers/openassistant-guanaco --bf16 True --output_dir ./model_lora_starcoder2_15b --num_train_epochs 2 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 1e-4 --logging_steps 1 --dataset_concatenation --do_train --use_habana --use_lazy_mode --throughput_warmup_steps 0 --lora_target_modules q_proj o_proj k_proj v_proj gate_proj up_proj down_proj
Error: ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
@anishagartia It's probably an error in the tokenizer config of bigcode/starcoder2-15b-instruct-v0.1
. Let me check that internally and I'll follow up with you.
@anishagartia The issue should be solved now, can you try again and let me know if it works on your side?
Yes it works now thank you.