FasterTransformer
FasterTransformer copied to clipboard
[enhancement] support llama
Implement LlaMa as requested in issue #506 .
Steps to use
first convert llama-7b-hf weights from huggingface with huggingface_llama_convert.py
: python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b
next, compile and run llama_example
.
Test case
start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973]
out: [0,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366]
This implement for llama is very meaningful and do you test the performance of this ? How fast can this be when compares with vanilla transformers api?
This implement for llama is very meaningful and do you test the performance of this ? How fast can this be when compares with vanilla transformers api?
I've been super busy lately, don't quite have the time for performance comparison, hopefully someone will do the favor and compare FT with transformers api. :-)
Does this implement int8 (or even 4bit) by any chance?
Some updates:
- supported bf16
- supported triton decouple mode
- verified that Llama 65B is working
Implement LlaMa as requested in issue #506 .
Steps to use
first convert llama-7b-hf weights from huggingface with
huggingface_llama_convert.py
:python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b
next, compile and run
llama_example
.Test case
start_ids.csv:
[0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973]
out:[0,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366]
what's the parameters for kernel-autotuning for llama model?
Does this implement int8 (or even 4bit) by any chance?
FasterTransformer doesn't seem to support int4 at all right now. I would be interested in helping with int8 though, that should enable the 65B model to run tensor-parallel on my 2x A6000 GPUs.
Does this implement int8 (or even 4bit) by any chance?
FasterTransformer doesn't seem to support int4 at all right now. I would be interested in helping with int8 though, that should enable the 65B model to run tensor-parallel on my 2x A6000 GPUs.
+1 happy to contribute to this
Hi @byshiue , now that we have made much progress and verified on many models, I wonder if it is possible to get this PR reviewed / merged?
I noticed that llama_example.cpp generates correct outputs in FP16, while triton does not. Does anyone know why?
I noticed that llama_example.cpp generates correct outputs in FP16, while triton does not. Does anyone know why?
@michaelroyzen looks like the root cause is what @yinghai pointed out. merge this pr fixes the issue for decouple mode
@byshiue Could this get merged, please?
Implement LlaMa as requested in issue #506 . next, compile and run
llama_example
.
Conversion works fine, thanks!
Is there any guide on how to compile and run
files in llama_example
, because simple make
fails for me, looks like I need to do some additional steps.
@void-main Have compared with ggml's llama.cpp with cuBlas support?
Hi. I've recently tested this implementation on blip2_vicuna_instruct. It utilizes vit_qformer's embedding as a prefix_soft_embedding, which will be fed into vicuna with prompt's token_ids.
According to my test result, I found that: When testing only vicuna-13b, FT outputs same quality text as huggingface's. However, when token_ids are fed along with prefix_soft_embedding, a noticeable quality decrease occurs.
For example,
image:
prompt:
Describe the environment in which the product in the middle of the image is located
pytorch output:
. The product in the middle of this image is located within a refrigerator, surrounded by various fruits and vegetables on both sides as well
FT output:
. The refrigerator is open and filled with food.
The refrigerator is open and filled with food.
Does anyone has experience in using fasterTransformer's prefix soft prompt feature. What problem might cause this issue. Counld it be a usage mistake? I need some hits to debug it.
Thanks in advance!
[EDITED]: issue solved
@void-main Hi, I'm also in Beijing and I'm a developer in AI inference. could I have your wechat?
@void-main Hi, I'm also in Beijing and I'm a developer in AI inference. could I have your wechat?
sure, try send me an email. :-)
I found that the results of rotary embedding is different for FT and huggingface. Has anyone met similar problems?
@void-main Hello,i found a bug that after multiple (thousands of) batch(20) inference, some batches may output randomly. But if the triton service is restarted, it can be inferred normally.
When the batch size is equal 5, I haven't found it yet.
Prompt Mixed Chinese and English Some Answer Example:
该��ate-to-p>\n\n\n\n\n\n\n\n\n\n\n\n\nIt's a\n\nIt's a new-est\nIt in the\nIt's at all about\nIt's at the\nIt's at the\nIt's at the\nIt's\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's
该宏在况况冲况��'s 不, 不, 它的 它的 ��, ��\n\n\n\n\n\n\n\n\n\n\n\nJupit was not you be ��, ��\n/b\n/and ��\n/��\n/��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/��\n/\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��
Compile based on Fastertransformer Backend
Device: V100/ A100 4gpu Model: Vicuna 13B-v1.1 parameters: top_k = 1, output_token_len = 500, batch_size = 20
Another problem is When batch inference is used, the results generated by the same prompt are different.
paramters: top_k=1, random_seed=1, output_len=500 device: T4/ A100 4-gpu by triton sever
prompt = "写一篇关于爱情的故事"
answer:
['text': '。\n\n爱情,这个词汇,它既是最美妙的东西,又是最艰难的。它可以让人感到无比的高昂,也可以让人陷入无比的低谷。\n\n有一天,有两个人,他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起,聊天,发现彼此都有着相似的经历。他们都是单身,都在寻找真爱。\n\n他们决定一起去旅行,去探索这个世界。他们在旅途中,发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战,但是他们始终相互扶持,相互支持。\n\n最终,他们到达了一个美丽的小镇,这里是他们想要的地方。他们决定在这里生活,共同度过剩下的生活。他们彼此扶持,一起面对未来,一起迎接挑战。\n\n他们的爱情,不是一种刻意的选择,而是一种自然的选择。他们在彼此的陪伴下,找到了真正的意义。他们的爱情,是一种强大的力量,可以在任何情况下帮助他们克服困难。\n',
'index': 0},
{'text': '。\n\n爱情,这个词汇,它既是最美妙的东西,又是最艰难的。它可以让人感到无比的高昂,也可以让人陷入无比的低谷。\n\n有一天,有两个人,他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起,聊天,发现彼此都有着相似的经历。他们都是单身,都在寻找真爱。\n\n他们决定一起去旅行,去探索这个世界。他们在旅途中,发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战,但是他们始终相互扶持,相互支持。\n\n最终,他们到达了一个美丽的小镇,这里是他们想要的地方。他们决定在这里生活,共同度过剩下的生活。他们彼此扶持,一起面对未来,一起迎接挑战。\n\n他们的爱情,不是一种刻意的选择,而是一种自然的选择。他们在彼此的陪伴下,找到了真正的意义。他们的爱情,是一种强大的力量,可以在任何情况下帮助他们克服困难。\n',
'index': 1},
{'text': '。\n\n爱情,这个词汇,它既是最美妙的东西,又是最艰难的。它可以让人感到无比的高昂,也可以让人陷入无比的低谷。\n\n有一天,有两个人,他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起,聊天,发现彼此都有着相似的经历。他们都是单身,都在寻找真爱。\n\n他们决定一起去旅行,去探索这个世界。他们在旅途中,发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战,但是他们始终相互扶持,相互支持。\n\n最终,他们到达了一个美丽的小镇,这里是他们想要的地方。他们决定在这里生活,共同度过剩下的生活。他们彼此扶持,一起面对未来,一起迎接挑战。\n\n他们的爱情,不是一种刻意的选择,而是一种自然的选择。他们在彼此的陪伴下,找到了真正的意义。他们的爱情,是一种强大的力量,可以战胜一切困难。\n\n这是一个关于�',
'index': 2},
{'text': '。\n\n爱情,这个词汇,它既是最美妙的东西,又是最艰难的。它可以让人感到无比的高昂,也可以让人陷入无比的低谷。\n\n有一天,有两个人,他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起,聊天,发现彼此都有着相似的经历。他们都是单身,都在寻找真爱。\n\n他们决定一起去旅行,去探索这个世界。他们在旅途中,发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战,但是他们始终相互扶持,相互支持。\n\n最终,他们到达了一个美丽的小镇,这里是他们想要的地方。他们决定在这里生活,共同度过剩下的生活。他们彼此扶持,一起面对未来,一起迎接挑战。\n\n他们的爱情,不是一种刻意的选择,而是一种自然的选择。他们在彼此的陪伴下,找到了真正的意义。他们的爱情,是一种强大的力量,可以在任何情况下帮助他们克服困难。\n',
'index': 3},
{'text': '。\n\n爱情,这个词汇,它既是最美妙的东西,又是最艰难的。它可以让人感到无比的高昂,也可以让人陷入无比的低谷。\n\n有一天,有两个人,他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起,聊天,发现彼此都有着相似的经历。他们都是单身,都在寻找真爱。\n\n他们决定一起去旅行,去探索这个世界。他们在旅途中,发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战,但是他们始终相互扶持,相互支持。\n\n最终,他们到达了一个美丽的小镇,这里是他们想要的地方。他们决定在这里生活,共同度过剩下的生活。他们彼此扶持,一起面对未来,一起迎接挑战。\n\n他们的爱情,不是一种刻意的选择,而是一种自然的选择。他们在彼此的陪伴下,找到了真正的意义。他们的爱情,是一种强大的力量,可以战胜一切困难。\n\n这是一个关于�', 'index': 4}]
It seems that torch.cos() and c cos func generates slightly different results which leads to the different results of rotary embedding. Anyone has idea for the solution?
You are right, the model should use the basic type of rotary.
First of all, thx for your Implement of ft LlaMa.@void-main I push a PR to support Int8 and share context. Anyone can help me to check it?
Hi @CN-COTER , thanks for the contribution! really appreciate it! I've checked your code and started a review, could you please take a look. 🎉
Implement LlaMa as requested in issue #506 .
Steps to use
first convert llama-7b-hf weights from huggingface with
huggingface_llama_convert.py
:python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b
next, compile and run
llama_example
.Test case
start_ids.csv:
[0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973]
out:[0,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366]
My start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973]
out:
0 18637 29892 526 366 1136 455 2470 29973 1815 366 5193 304 592 29973 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 2 2991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991
decode out: Hey, are you consciours? Can you talk to me?olgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolg
It looks like there's a problem with the out and decode out. Please give me some suggestions.
Email has been received.This is an automatic reply, confirming that your email was received.Thank you.
@void-main Hello,i found a bug that after multiple (thousands of) batch(20) inference, some batches may output randomly. But if the triton service is restarted, it can be inferred normally.
When the batch size is equal 5, I haven't found it yet.
Prompt Mixed Chinese and English Some Answer Example:
该��ate-to-p>\n\n\n\n\n\n\n\n\n\n\n\n\nIt's a\n\nIt's a new-est\nIt in the\nIt's at all about\nIt's at the\nIt's at the\nIt's at the\nIt's\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's 该宏在况况冲况��'s 不, 不, 它的 它的 ��, ��\n\n\n\n\n\n\n\n\n\n\n\nJupit was not you be ��, ��\n/b\n/and ��\n/��\n/��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/��\n/\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��
Compile based on Fastertransformer Backend
Device: V100/ A100 4gpu Model: Vicuna 13B-v1.1 parameters: top_k = 1, output_token_len = 500, batch_size = 20
Same issue. Have you find the solution?
Implement LlaMa as requested in issue #506 .
Steps to use
first convert llama-7b-hf weights from huggingface with
huggingface_llama_convert.py
:python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b
next, compile and runllama_example
.Test case
start_ids.csv:
[0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973]
out:[0,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366]
My start_ids.csv:
[0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973]
out: 0 18637 29892 526 366 1136 455 2470 29973 1815 366 5193 304 592 29973 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 2 2991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991decode out: Hey, are you consciours? Can you talk to me?olgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolg
It looks like there's a problem with the out and decode out. Please give me some suggestions.
@void-main Please give some suggestions. Thank you!
@void-main Hello,i found a bug that after multiple (thousands of) batch(20) inference, some batches may output randomly. But if the triton service is restarted, it can be inferred normally. When the batch size is equal 5, I haven't found it yet. Prompt Mixed Chinese and English Some Answer Example:
该��ate-to-p>\n\n\n\n\n\n\n\n\n\n\n\n\nIt's a\n\nIt's a new-est\nIt in the\nIt's at all about\nIt's at the\nIt's at the\nIt's at the\nIt's\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's 该宏在况况冲况��'s 不, 不, 它的 它的 ��, ��\n\n\n\n\n\n\n\n\n\n\n\nJupit was not you be ��, ��\n/b\n/and ��\n/��\n/��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/��\n/\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��
Compile based on Fastertransformer Backend Device: V100/ A100 4gpu Model: Vicuna 13B-v1.1 parameters: top_k = 1, output_token_len = 500, batch_size = 20
Same issue. Have you find the solution?
+1 but didn't found repro step yet
Will llama-2 70b arch be supported in the future? @void-main Thanks