optimum-habana icon indicating copy to clipboard operation
optimum-habana copied to clipboard

Optimized inference of Starcoder2 model

Open XinyuYe-Intel opened this issue 11 months ago • 8 comments

What does this PR do?

Works

  1. Optimized Starcoder2 model inference.
  2. Validated bigcode/starcoder2-3b inference on single card.

Validation results

static shape generation

python run_generation.py --model_name_or_path bigcode/starcoder2-3b --use_kv_cache --max_new_tokens 100 --bf16 --batch_size 4
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for deep learning.\n\n# How to build\n\n`python-shufflenet` is a python-only implementation, it depends on python libraries only for Python3.\n\n## Build with gcc and g++\n\nIf you are in a UNIX-like environment with gcc and g++ installed, the `Makefile` in `build/Makefile.<os>.gcc` is recommended.\n\n```\ncd build && make -f Makefile.Darwin.gcc\n```\n',)

input 2: ('He is working on',)
output 1: ("He is working on your team and can't wait to meet your success',\n      'He should be flying to the moon with us',\n      'If you do these things you will get paid twice as much as he does',\n      'He and I are having a great time',\n      'He makes so many great people like us',\n      'He thinks that success is being born of hard work and dedication',\n      'He is so excited about working at MindTouch',\n      'He likes the",)

input 3: ('He has a',)
output 1: ('He has a very beautiful hand.") }\n            }\n            child {\n                text(body.text) {\n                    attr(\'data-testid\', \'body\')\n                }\n            }\n        }\n    }\n\n    @Test\n    fun `verify content`() {\n        val expectedBodyText = buildHtml(root)\n        val actualBodyText = renderBody(root)\n        assertThat(expectedBodyText).isEqualTo(actualBodyText)\n    }\n\n    @Test\n    fun `verify attributes`() {\n        subject.verifyAttributes {',)

input 4: ('He got all',)
output 1: ("He got all their money and all the bribes they'd been buying with him and fucked him like the man he wasn't before. This was his first time with, but he was so close to the end. He knew just what to do; he was the one who had fucked his asshole, and the asshole he had fucked was.\n\n I've known this man for ages. He likes to make stuff look good. He never took",)


Stats:
-------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 112.6257038793236 tokens/second
Number of HPU graphs                = 6
Memory allocated                    = 6.17 GB
Max memory allocated                = 6.27 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 22.754922633990645 seconds
-------------------------------------------------------------------------------------------------------------                                

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you make sure to update the documentation with your changes?
  • [x] Did you write any new necessary tests?

XinyuYe-Intel avatar Mar 22 '24 03:03 XinyuYe-Intel

Hi, can you please review this PR?

XinyuYe-Intel avatar Mar 29 '24 02:03 XinyuYe-Intel

@XinyuYe-Intel , pls 1) provide the performance and output comparison btw Gaudi2 and A100 2) add ci test, thx.

yao-matrix avatar Apr 19 '24 01:04 yao-matrix

hi, xinyu @XinyuYe-Intel I have add some code patch https://github.com/XinyuYe-Intel/optimum-habana/pull/1 for this starcoder pr, please review~ the changes:

  • replace self._get_generation_mode to generation_config.get_generation_mode because of transformers>=4.39.0
  • keep transformers>=4.38.2 StoppingCriteriaList, because of some error occurs with the changes at transformers>=4.39.0 image
  • remove sdp/flash attention code, force use eager attention because of transformers>=4.39.0 using sdp as default.
  • add ci test and starcoder2 model index at the doc

lkk12014402 avatar Apr 20 '24 12:04 lkk12014402

hi @yao-matrix @libinta there is the comparison of inference (Gaudi2 and A100)

Inference/generation performance (Gaudi2 and A100)

single card Gaudi2

python run_generation.py --model_name_or_path bigcode/starcoder2-3b --use_kv_cache --max_new_tokens 100 --bf16 --batch_size 1

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for deep learning.\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#',)


Stats:
-------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 33.99791086852853 tokens/second
Number of HPU graphs                = 5
Memory allocated                    = 6.17 GB
Max memory allocated                = 6.19 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 16.111402586102486 seconds
-------------------------------------------------------------------------------------------------------------

-- use --use_hpu_graph

python run_generation.py --model_name_or_path bigcode/starcoder2-3b --use_kv_cache --max_new_tokens 100 --bf16 --batch_size 1 --use_hpu_graph

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for deep learning.\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#',)


Stats:
--------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 234.2649120507936 tokens/second
Number of HPU graphs                = 12
Memory allocated                    = 6.19 GB
Max memory allocated                = 6.2 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 8.577247043140233 seconds
--------------------------------------------------------------------------------------------------------------

single card A100-80G

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for deep learning.\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#',)


Stats:
----------------------------------------------------------------------
Throughput (including tokenization) = 49.869579574434376 tokens/second
----------------------------------------------------------------------

lkk12014402 avatar Apr 20 '24 12:04 lkk12014402

hi @yao-matrix @libinta there is the comparison of training (Gaudi2 and A100)

validate training (Gaudi2 and A100)

  • bigcode/starcoder2-3b, single card Gaudi2, lora image

  • bigcode/starcoder2-7b, single card Gaudi2, lora image

  • bigcode/starcoder2-3b, single card A100-80G, lora image

note: multi-cards training also works well on gaudi2

lkk12014402 avatar Apr 20 '24 12:04 lkk12014402

hi, @libinta can you help review this pr? Thanks~

lkk12014402 avatar Apr 20 '24 13:04 lkk12014402

what is the transformer version you have when you run tests for this PR? Because I see starcoder2 only in transformer 4.39, and Optimum Habana's version of transformer is less than that for now

Yes, starcoder2 only exists in transformer above 4.39, so we used transformer 4.39.0 and the test runs well, as shown below. image

XinyuYe-Intel avatar Apr 26 '24 01:04 XinyuYe-Intel

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Updated the branch to main (with 4.40 transformer)

python run_generation.py --model_name_or_path bigcode/starcoder2-3b --use_kv_cache --max_new_tokens 100 --bf16 --batch_size 4
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for deep learning.\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#\n#',)

input 2: ('He is working on',)
output 1: ('He is working on a new project.\n\nHe has a lot of work to do, but he is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\nHe is very busy.\n\n',)

input 3: ('He has a',)
output 1: ('He has a 100% chance of winning the lottery.\n\nThe probability of him winning the lottery is 1/100.\n\nThe probability of him not winning the lottery is 99/100.\n\nThe probability of him winning the lottery is 1/100.\n\nThe probability of him not winning the lottery is 99/100.\n\nThe probability of him win',)

input 4: ('He got all',)
output 1: ('He got all the way to the end of the line.\n\nThe first thing I did was to check the documentation for the `get_current_line()` function.\n\n```\nget_current_line()\n\n```\n\n> \n> Returns the current line number.\n> \n> \n> \n\nSo I tried to use it.\n\n```\nprint(get_current_line())\n\n```\n\nAnd it printed `0`.\n\nI',)
Throughput (including tokenization) = 111.32884028982812 tokens/second
Number of HPU graphs                = 5
Memory allocated                    = 6.17 GB
Max memory allocated                = 6.26 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 17.493904751027003 seconds

ssarkar2 avatar May 31 '24 20:05 ssarkar2

@lkk12014402 Please share the LORA commands.

I am getting an error with 1.16 and this PR for finetuning.

python run_lora_clm.py --model_name_or_path bigcode/starcoder2-15b-instruct-v0.1 --dataset_name timdettmers/openassistant-guanaco --bf16 True --output_dir ./model_lora_starcoder2_15b --num_train_epochs 2 --per_device_train_batch_size 2  --per_device_eval_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 1e-4 --logging_steps 1 --dataset_concatenation --do_train --use_habana --use_lazy_mode     --throughput_warmup_steps 0   --lora_target_modules q_proj o_proj k_proj v_proj gate_proj up_proj down_proj

Error: ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

image

anishagartia avatar Jun 04 '24 22:06 anishagartia

@anishagartia It's probably an error in the tokenizer config of bigcode/starcoder2-15b-instruct-v0.1. Let me check that internally and I'll follow up with you.

regisss avatar Jun 05 '24 17:06 regisss

@anishagartia The issue should be solved now, can you try again and let me know if it works on your side?

regisss avatar Jun 05 '24 19:06 regisss

Yes it works now thank you.

anishagartia avatar Jun 06 '24 16:06 anishagartia