FasterTransformer [enhancement] support llama

Implement LlaMa as requested in issue #506 .

Steps to use

first convert llama-7b-hf weights from huggingface with huggingface_llama_convert.py: python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b

next, compile and run llama_example.

Test case

start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973] out: [0,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366]

Apr 24 '23 07:04 void-main

This implement for llama is very meaningful and do you test the performance of this ? How fast can this be when compares with vanilla transformers api?

Apr 25 '23 16:04 syslot

This implement for llama is very meaningful and do you test the performance of this ? How fast can this be when compares with vanilla transformers api?

I've been super busy lately, don't quite have the time for performance comparison, hopefully someone will do the favor and compare FT with transformers api. :-)

Apr 27 '23 01:04 void-main

Does this implement int8 (or even 4bit) by any chance?

Apr 28 '23 07:04 152334H

Some updates:

supported bf16
supported triton decouple mode
verified that Llama 65B is working

Apr 30 '23 00:04 void-main

Implement LlaMa as requested in issue #506 .

Steps to use

first convert llama-7b-hf weights from huggingface with huggingface_llama_convert.py: python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b

next, compile and run llama_example.

Test case

start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973] out: [0,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366]

what's the parameters for kernel-autotuning for llama model?

May 01 '23 03:05 pineking

Does this implement int8 (or even 4bit) by any chance?

FasterTransformer doesn't seem to support int4 at all right now. I would be interested in helping with int8 though, that should enable the 65B model to run tensor-parallel on my 2x A6000 GPUs.

May 01 '23 13:05 atyshka

Does this implement int8 (or even 4bit) by any chance?

FasterTransformer doesn't seem to support int4 at all right now. I would be interested in helping with int8 though, that should enable the 65B model to run tensor-parallel on my 2x A6000 GPUs.

+1 happy to contribute to this

May 01 '23 19:05 happytree09

Hi @byshiue , now that we have made much progress and verified on many models, I wonder if it is possible to get this PR reviewed / merged?

May 02 '23 01:05 void-main

I noticed that llama_example.cpp generates correct outputs in FP16, while triton does not. Does anyone know why?

May 05 '23 00:05 michaelroyzen

I noticed that llama_example.cpp generates correct outputs in FP16, while triton does not. Does anyone know why?

@michaelroyzen looks like the root cause is what @yinghai pointed out. merge this pr fixes the issue for decouple mode

May 06 '23 08:05 void-main

@byshiue Could this get merged, please?

May 10 '23 03:05 michaelroyzen

Implement LlaMa as requested in issue #506 . next, compile and run llama_example.

Conversion works fine, thanks! Is there any guide on how to compile and run files in llama_example, because simple make fails for me, looks like I need to do some additional steps.

May 15 '23 13:05 RomaA2000

@void-main Have compared with ggml's llama.cpp with cuBlas support?

May 25 '23 08:05 lucasjinreal

Hi. I've recently tested this implementation on blip2_vicuna_instruct. It utilizes vit_qformer's embedding as a prefix_soft_embedding, which will be fed into vicuna with prompt's token_ids.

According to my test result, I found that: When testing only vicuna-13b, FT outputs same quality text as huggingface's. However, when token_ids are fed along with prefix_soft_embedding, a noticeable quality decrease occurs.

For example, image: ref prompt: Describe the environment in which the product in the middle of the image is located

pytorch output：

. The product in the middle of this image is located within a refrigerator, surrounded by various fruits and vegetables on both sides as well

FT output:

. The refrigerator is open and filled with food.
The refrigerator is open and filled with food.

Does anyone has experience in using fasterTransformer's prefix soft prompt feature. What problem might cause this issue. Counld it be a usage mistake? I need some hits to debug it.

Thanks in advance!

[EDITED]: issue solved

Jun 09 '23 11:06 handoku

@void-main Hi, I'm also in Beijing and I'm a developer in AI inference. could I have your wechat?

Jun 19 '23 09:06 sleepwalker2017

@void-main Hi, I'm also in Beijing and I'm a developer in AI inference. could I have your wechat?

sure, try send me an email. :-)

Jun 20 '23 04:06 void-main

I found that the results of rotary embedding is different for FT and huggingface. Has anyone met similar problems?

Jun 27 '23 02:06 frankxyy

@void-main Hello，i found a bug that after multiple (thousands of) batch（20） inference, some batches may output randomly. But if the triton service is restarted, it can be inferred normally.

When the batch size is equal 5, I haven't found it yet.

Prompt Mixed Chinese and English Some Answer Example:

该��ate-to-p>\n\n\n\n\n\n\n\n\n\n\n\n\nIt's a\n\nIt's a new-est\nIt in the\nIt's at all about\nIt's at the\nIt's at the\nIt's at the\nIt's\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's 

该宏在况况冲况��'s 不, 不, 它的 它的 ��, ��\n\n\n\n\n\n\n\n\n\n\n\nJupit was not you be ��, ��\n/b\n/and ��\n/��\n/��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/��\n/\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��

Compile based on Fastertransformer Backend

Device: V100/ A100 4gpu Model: Vicuna 13B-v1.1 parameters: top_k = 1, output_token_len = 500, batch_size = 20

Jun 28 '23 07:06 UnknownSwordsman

Another problem is When batch inference is used, the results generated by the same prompt are different.

paramters: top_k=1, random_seed=1, output_len=500 device: T4/ A100 4-gpu by triton sever

prompt = "写一篇关于爱情的故事"

answer:

['text': '。\n\n爱情，这个词汇，它既是最美妙的东西，又是最艰难的。它可以让人感到无比的高昂，也可以让人陷入无比的低谷。\n\n有一天，有两个人，他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起，聊天，发现彼此都有着相似的经历。他们都是单身，都在寻找真爱。\n\n他们决定一起去旅行，去探索这个世界。他们在旅途中，发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战，但是他们始终相互扶持，相互支持。\n\n最终，他们到达了一个美丽的小镇，这里是他们想要的地方。他们决定在这里生活，共同度过剩下的生活。他们彼此扶持，一起面对未来，一起迎接挑战。\n\n他们的爱情，不是一种刻意的选择，而是一种自然的选择。他们在彼此的陪伴下，找到了真正的意义。他们的爱情，是一种强大的力量，可以在任何情况下帮助他们克服困难。\n',
   'index': 0},
  {'text': '。\n\n爱情，这个词汇，它既是最美妙的东西，又是最艰难的。它可以让人感到无比的高昂，也可以让人陷入无比的低谷。\n\n有一天，有两个人，他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起，聊天，发现彼此都有着相似的经历。他们都是单身，都在寻找真爱。\n\n他们决定一起去旅行，去探索这个世界。他们在旅途中，发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战，但是他们始终相互扶持，相互支持。\n\n最终，他们到达了一个美丽的小镇，这里是他们想要的地方。他们决定在这里生活，共同度过剩下的生活。他们彼此扶持，一起面对未来，一起迎接挑战。\n\n他们的爱情，不是一种刻意的选择，而是一种自然的选择。他们在彼此的陪伴下，找到了真正的意义。他们的爱情，是一种强大的力量，可以在任何情况下帮助他们克服困难。\n',
   'index': 1},
  {'text': '。\n\n爱情，这个词汇，它既是最美妙的东西，又是最艰难的。它可以让人感到无比的高昂，也可以让人陷入无比的低谷。\n\n有一天，有两个人，他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起，聊天，发现彼此都有着相似的经历。他们都是单身，都在寻找真爱。\n\n他们决定一起去旅行，去探索这个世界。他们在旅途中，发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战，但是他们始终相互扶持，相互支持。\n\n最终，他们到达了一个美丽的小镇，这里是他们想要的地方。他们决定在这里生活，共同度过剩下的生活。他们彼此扶持，一起面对未来，一起迎接挑战。\n\n他们的爱情，不是一种刻意的选择，而是一种自然的选择。他们在彼此的陪伴下，找到了真正的意义。他们的爱情，是一种强大的力量，可以战胜一切困难。\n\n这是一个关于�',
   'index': 2},
  {'text': '。\n\n爱情，这个词汇，它既是最美妙的东西，又是最艰难的。它可以让人感到无比的高昂，也可以让人陷入无比的低谷。\n\n有一天，有两个人，他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起，聊天，发现彼此都有着相似的经历。他们都是单身，都在寻找真爱。\n\n他们决定一起去旅行，去探索这个世界。他们在旅途中，发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战，但是他们始终相互扶持，相互支持。\n\n最终，他们到达了一个美丽的小镇，这里是他们想要的地方。他们决定在这里生活，共同度过剩下的生活。他们彼此扶持，一起面对未来，一起迎接挑战。\n\n他们的爱情，不是一种刻意的选择，而是一种自然的选择。他们在彼此的陪伴下，找到了真正的意义。他们的爱情，是一种强大的力量，可以在任何情况下帮助他们克服困难。\n',
   'index': 3},
  {'text': '。\n\n爱情，这个词汇，它既是最美妙的东西，又是最艰难的。它可以让人感到无比的高昂，也可以让人陷入无比的低谷。\n\n有一天，有两个人，他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起，聊天，发现彼此都有着相似的经历。他们都是单身，都在寻找真爱。\n\n他们决定一起去旅行，去探索这个世界。他们在旅途中，发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战，但是他们始终相互扶持，相互支持。\n\n最终，他们到达了一个美丽的小镇，这里是他们想要的地方。他们决定在这里生活，共同度过剩下的生活。他们彼此扶持，一起面对未来，一起迎接挑战。\n\n他们的爱情，不是一种刻意的选择，而是一种自然的选择。他们在彼此的陪伴下，找到了真正的意义。他们的爱情，是一种强大的力量，可以战胜一切困难。\n\n这是一个关于�', 'index': 4}]

Jun 28 '23 07:06 UnknownSwordsman

It seems that torch.cos() and c cos func generates slightly different results which leads to the different results of rotary embedding. Anyone has idea for the solution?

You are right, the model should use the basic type of rotary.

Jun 28 '23 09:06 prnake

First of all, thx for your Implement of ft LlaMa.@void-main I push a PR to support Int8 and share context. Anyone can help me to check it?

Jun 29 '23 12:06 CN-COTER

Hi @CN-COTER , thanks for the contribution! really appreciate it! I've checked your code and started a review, could you please take a look. 🎉

Jun 30 '23 02:06 void-main

Implement LlaMa as requested in issue #506 .

Steps to use

first convert llama-7b-hf weights from huggingface with huggingface_llama_convert.py: python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b

next, compile and run llama_example.

Test case

start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973] out: [0,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366]

My start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973] out: 0 18637 29892 526 366 1136 455 2470 29973 1815 366 5193 304 592 29973 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 2 2991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991

decode out: Hey, are you consciours? Can you talk to me?olgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolg

It looks like there's a problem with the out and decode out. Please give me some suggestions.

Jul 25 '23 09:07 double-vin

Email has been received.This is an automatic reply, confirming that your email was received.Thank you.

Jul 25 '23 09:07 DeNgKee

@void-main Hello，i found a bug that after multiple (thousands of) batch（20） inference, some batches may output randomly. But if the triton service is restarted, it can be inferred normally.

When the batch size is equal 5, I haven't found it yet.

Prompt Mixed Chinese and English Some Answer Example:

该��ate-to-p>\n\n\n\n\n\n\n\n\n\n\n\n\nIt's a\n\nIt's a new-est\nIt in the\nIt's at all about\nIt's at the\nIt's at the\nIt's at the\nIt's\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's 

该宏在况况冲况��'s 不, 不, 它的 它的 ��, ��\n\n\n\n\n\n\n\n\n\n\n\nJupit was not you be ��, ��\n/b\n/and ��\n/��\n/��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/��\n/\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��

Compile based on Fastertransformer Backend

Device: V100/ A100 4gpu Model: Vicuna 13B-v1.1 parameters: top_k = 1, output_token_len = 500, batch_size = 20

Same issue. Have you find the solution?

Aug 01 '23 02:08 realgump

Implement LlaMa as requested in issue #506 .

Steps to use

first convert llama-7b-hf weights from huggingface with huggingface_llama_convert.py: python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b next, compile and run llama_example.

Test case

start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973] out: [0,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366]

My start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973] out: 0 18637 29892 526 366 1136 455 2470 29973 1815 366 5193 304 592 29973 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 2 2991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991

decode out: Hey, are you consciours? Can you talk to me?olgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolg

It looks like there's a problem with the out and decode out. Please give me some suggestions.

@void-main Please give some suggestions. Thank you！

Aug 02 '23 01:08 double-vin

@void-main Hello，i found a bug that after multiple (thousands of) batch（20） inference, some batches may output randomly. But if the triton service is restarted, it can be inferred normally. When the batch size is equal 5, I haven't found it yet. Prompt Mixed Chinese and English Some Answer Example:

该��ate-to-p>\n\n\n\n\n\n\n\n\n\n\n\n\nIt's a\n\nIt's a new-est\nIt in the\nIt's at all about\nIt's at the\nIt's at the\nIt's at the\nIt's\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's 

该宏在况况冲况��'s 不, 不, 它的 它的 ��, ��\n\n\n\n\n\n\n\n\n\n\n\nJupit was not you be ��, ��\n/b\n/and ��\n/��\n/��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/��\n/\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��

Compile based on Fastertransformer Backend Device: V100/ A100 4gpu Model: Vicuna 13B-v1.1 parameters: top_k = 1, output_token_len = 500, batch_size = 20

Same issue. Have you find the solution?

+1 but didn't found repro step yet

Aug 07 '23 14:08 valtab

Will llama-2 70b arch be supported in the future? @void-main Thanks

Aug 08 '23 02:08 jcao-ai

FasterTransformer FasterTransformer copied to clipboard

[enhancement] support llama

Steps to use

Test case

Steps to use

Test case

Steps to use

Test case

Steps to use

Test case

FasterTransformer
FasterTransformer copied to clipboard