Qwen Qwen-14B-Chat微调后模型 + fastchat 0.2.29 在2x4090上推理速度比其他13B模型慢很多

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

期望行为 | Expected Behavior

无

复现方法 | Steps To Reproduce

安装fastchat 0.2.29
基于Qwen-14B-Chat微调模型
使用openai接口测试

运行环境 | Environment

- OS:Ubuntu 22.04
- Python:3.10.9
- Transformers:4.32.0
- fastchat:0.2.29
- PyTorch:2.0.1+cu118
- flash-attn ：2.2.5
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.8

备注 | Anything else?

无

Sep 27 '23 07:09 baibaiw5

我感觉现在新7B也慢了不少吧？

Sep 27 '23 08:09 gumanchang

做个调查，你们在训练或推理时有开启flash-attention吗？如果开启了那相较于旧版代码应该更快才对，因为v1.1里flash-attention的计算是去除了padding的。另外，在不开flash的情况下v1.1的代码相较于之前确实有可能慢一些，因为我们在计算softmax时先把attn_weights转成了fp32，这样可以减少精度损失，softmax

Sep 27 '23 12:09 logicwong

我没有开启flash attention ，那现在还能恢复之前的速度不，或者有参数调整

---- 回复的原邮件 ---- | 发件人 | Wang @.> | | 发送日期 | 2023年09月27日 20:06 | | 收件人 | QwenLM/Qwen @.> | | 抄送人 | Gu @.>, Comment @.> | | 主题 | Re: [QwenLM/Qwen] Qwen-14B-Chat微调后模型 + fastchat 0.2.29 在2x4090上推理速度比其他13B模型慢很多 (Issue #372) |

做个调查，你们在训练或推理时有开启flash-attention吗？如果开启了那相较于旧版代码应该更快才对，因为v1.1里flash-attention的计算是去除了padding的。另外，在不开flash的情况下v1.1的代码相较于之前确实有可能慢一些，因为我们在计算softmax时先把attn_weights转成了fp32，这样可以减少精度损失，softmax

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Sep 27 '23 12:09 gumanchang

做个调查，你们在训练或推理时有开启flash-attention吗？如果开启了那相较于旧版代码应该更快才对，因为v1.1里flash-attention的计算是去除了padding的。另外，在不开flash的情况下v1.1的代码相较于之前确实有可能慢一些，因为我们在计算softmax时先把attn_weights转成了fp32，这样可以减少精度损失，softmax

训练到时候开启了flash-attention2 + csrc/layer_norm+ csrc/rotary，配置文件："use_flash_attn": auto,训练速度相比不开启flash-attension是快了一些。推理的时候安装了flash-attention2 + csrc/layer_norm+ csrc/rotary，落地模型的时候config.json中的"use_flash_attn": false，相比Llama13B和Baichuan-13B慢了好多. 我后面把它设置为true对比看下。

Sep 27 '23 14:09 baibaiw5

做个调查，你们在训练或推理时有开启flash-attention吗？如果开启了那相较于旧版代码应该更快才对，因为v1.1里flash-attention的计算是去除了padding的。另外，在不开flash的情况下v1.1的代码相较于之前确实有可能慢一些，因为我们在计算softmax时先把attn_weights转成了fp32，这样可以减少精度损失，softmax

训练到时候开启了flash-attention2 + csrc/layer_norm+ csrc/rotary，配置文件："use_flash_attn": auto,训练速度相比不开启flash-attension是快了一些。推理的时候安装了flash-attention2 + csrc/layer_norm+ csrc/rotary，落地模型的时候config.json中的"use_flash_attn": false，相比Llama13B和Baichuan-13B慢了好多. 我后面把它设置为true对比看下。

开启了flash_attn推理后，原本问“你是谁”需要38s,目前只需要7s左右

Sep 28 '23 01:09 baibaiw5

做个调查，你们在训练或推理时有开启flash-attention吗？如果开启了那相较于旧版代码应该更快才对，因为v1.1里flash-attention的计算是去除了padding的。另外，在不开flash的情况下v1.1的代码相较于之前确实有可能慢一些，因为我们在计算softmax时先把attn_weights转成了fp32，这样可以减少精度损失，softmax

Then问题主要还是在flash attn上？

Sep 28 '23 02:09 JustinLin610

@JustinLin610 问题可能在softmax_in_fp32上。 @baibaiw5 了解了，softmax_in_fp32这个操作我加个超参，然后默认关闭吧，这样不开flash时应该能快一些

Sep 28 '23 03:09 logicwong

做个调查，你们在训练或推理时有开启flash-attention吗？如果开启了那相较于旧版代码应该更快才对，因为v1.1里flash-attention的计算是去除了padding的。另外，在不开flash的情况下v1.1的代码相较于之前确实有可能慢一些，因为我们在计算softmax时先把attn_weights转成了fp32，这样可以减少精度损失，softmax

我这GPU还是V100, 没办法安装flash attn加速

Sep 28 '23 03:09 gumanchang

我之前部署了Vicuna-13B模型，生成速度非常快，基本跟ChatGPT的速度差不多。但是在相同服务器上部署Qwen-14B，生成速度降到了大概2秒3个单词。慢了很多。

Sep 28 '23 03:09 thiner

@logicwong 想问同样是在V100 *N 上做微调，这样还能在推理时转A100开flash attention吗? 结果是有效的吗？还是会建议微调跟推理时都用一样设定比较好？

Sep 28 '23 04:09 ChaoChungWu-Johnson

@ChaoChungWu-Johnson 可以的。但是使用转A100推理时记得手动开启fp16精度，避免切换成bf16带来的精度损失

Sep 28 '23 07:09 logicwong

@logicwong 那现在V100显卡上推理是等你们添加了参数在测试性能吗？

Sep 28 '23 13:09 gumanchang

@gumanchang hugginface的代码已经同步了哈，可以测测看。modelscope的代码待同事操作。

PS：内网联不上hf，之前ms自动同步hf的脚本失效了，现在都需要手动同步

Sep 28 '23 14:09 logicwong

@logicwong 我尝试了最新代码的7B-v1.1,感觉速度还是没有之前的版本快了，肉眼可见的比不上之前版本7B.

Oct 01 '23 04:10 gumanchang

我用两张v100来运行demo做推理，感觉还是很慢

Oct 08 '23 07:10 rayqi36

2张V100 32G，一次对话需要30秒左右，未开启flash-attention

Oct 11 '23 06:10 cnsky2016

@gumanchang hugginface的代码已经同步了哈，可以测测看。modelscope的代码待同事操作。

PS：内网联不上hf，之前ms自动同步hf的脚本失效了，现在都需要手动同步

今天刚拉了hf ，感觉速度还是很慢

Oct 12 '23 08:10 AlexasXu

我们对代码进行了速度优化，速度相较于之前提升了30%以上（w & w/o flash attention），hf已经更新，modelscope晚点也会同步上去。大家可以更新到最新代码试下（推荐使用torch 2.0以上的版本进行测试）

Oct 14 '23 13:10 logicwong

我们对代码进行了速度优化，速度相较于之前提升了30%以上（w & w/o flash attention），hf已经更新，modelscope晚点也会同步上去。大家可以更新到最新代码试下（推荐使用torch 2.0以上的版本进行测试）

已同步最新HF文件和代码，在Linux、V100 GPU、CUDA=11.7、pytorch=2.0.1、python=3.10、Transformers=4.33.1环境下，调用model.chat_stream，单卡性能7.8汉字/s；在多卡下性能2.2汉字/s，请问这是什么原因？

Oct 14 '23 16:10 DanteYuan

我们对代码进行了速度优化，速度相较于之前提升了30%以上（w & w/o flash attention），hf已经更新，modelscope晚点也会同步上去。大家可以更新到最新代码试下（推荐使用torch 2.0以上的版本进行测试）

已同步最新HF文件和代码，在Linux、V100 GPU、CUDA=11.7、pytorch=2.0.1、python=3.10、Transformers=4.33.1环境下，调用model.chat_stream，单卡性能7.8汉字/s；在多卡下性能2.2汉字/s，请问这是什么原因？

可以分享下代码吗？我们排查一下问题

Oct 14 '23 16:10 logicwong

我们对代码进行了速度优化，速度相较于之前提升了30%以上（w & w/o flash attention），hf已经更新，modelscope晚点也会同步上去。大家可以更新到最新代码试下（推荐使用torch 2.0以上的版本进行测试）

已同步最新HF文件和代码，在Linux、V100 GPU、CUDA=11.7、pytorch=2.0.1、python=3.10、Transformers=4.33.1环境下，调用model.chat_stream，单卡性能7.8汉字/s；在多卡下性能2.2汉字/s，请问这是什么原因？

可以分享下代码吗？我们排查一下问题

# 配置模型路径
model_path = '/home/work/models/Qwen-14B-Chat'

# 加载模型
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    local_files_only=True,
    trust_remote_code=True
).eval()

# 问题
prompt = '你好'

# 打印返回
for resp in model.chat_stream(tokenizer, query=prompt, history=[]):
    print(resp)

Oct 14 '23 16:10 DanteYuan

我们对代码进行了速度优化，速度相较于之前提升了30%以上（w & w/o flash attention），hf已经更新，modelscope晚点也会同步上去。大家可以更新到最新代码试下（推荐使用torch 2.0以上的版本进行测试）

已同步最新HF文件和代码，在Linux、V100 GPU、CUDA=11.7、pytorch=2.0.1、python=3.10、Transformers=4.33.1环境下，调用model.chat_stream，单卡性能7.8汉字/s；在多卡下性能2.2汉字/s，请问这是什么原因？

可以分享下代码吗？我们排查一下问题

配置模型路径

model_path = '/home/work/models/Qwen-14B-Chat'

加载模型

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", local_files_only=True, trust_remote_code=True ).eval()

问题

prompt = '你好'

打印返回

for resp in model.chat_stream(tokenizer, query=prompt, history=[]): print(resp)

我明天找台V100测试

Oct 14 '23 17:10 JustinLin610

请问单卡4090推理速度 7 tokens/s，这是正常速度吗同一张卡上，qwen-7b-chat是60+tokens/s 这个性能差别和这里列举的7B-chat与14B-chat-int4之间的差别相差挺大的

还有就是在安装flash-attention之后推理速度并没有什么变化

环境： python3.8+torch2.0.0+cuda11.8+transformers4.36+flash-attn2.4.0
硬件：10700K+28G+4090/24G
代码：

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
from peft import AutoPeftModelForCausalLM
from tqdm import tqdm
import json
import time

model_path = "/localmodel/qwen/Qwen-14B-Chat-Int4"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained(model_path, trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参



with open("mydata.json", "r") as f:
    data = json.load(f)

leng = 0
t1 = time.time()
for conv in tqdm(data):
    response, _ = model.chat(tokenizer, conv["conversations"][0]["value"], history=None)
    print(leng, leng/(time.time()-t1))

Dec 22 '23 12:12 SilentMoebuta

@SilentMoebuta 量化模型速度特慢的话，一般是autogptq没有预编译对应的cuda kernel，需要找个匹配的版本，或者自己从source安装。

Jan 02 '24 11:01 jklj077

Qwen Qwen copied to clipboard

Qwen-14B-Chat微调后模型 + fastchat 0.2.29 在2x4090上推理速度比其他13B模型慢很多

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

配置模型路径

加载模型

问题

打印返回

Qwen
Qwen copied to clipboard