Qwen2.5 大佬们请问一下，qwen1.5是不是不使用flash-attention加速推理。那为什么相同硬件、安装包的两台服务器，模型运行速度不一样？Token都差不了多少

大佬们请问一下，qwen1.5是不是不使用flash-attention加速推理。那为什么相同硬件、安装包的两台服务器，模型运行速度不一样？Token都差不了多少

Open annian101 opened this issue 11 months ago • 3 comments

大佬们请问一下，qwen1.5是不是不使用flash-attention加速推理。那为什么相同硬件、安装包的两台服务器，模型运行速度不一样？Token都差不了多少

Mar 07 '24 03:03 annian101

transformers will infer the method of the attention implementation based on your actual environment (there are 3 implementations: pytorch manual, pytorch sdpa, flash-attention v2).

Mar 11 '24 13:03 jklj077

是transformers 4.37.0以上就不支持flash-attn了，我用的qwen1做的实验

Mar 13 '24 07:03 hhk123

是transformers 4.37.0以上就不支持flash-attn了，我用的qwen1做的实验

那是不是意味着Qwen1.5不能使用flash-attn了？Qwen1.5不是必须要使用transfomers4.37.0以上吗？我昨天也尝试flash-attn，但是感觉没有区别，反倒是chatglm3-6b-32k提升明显

Mar 14 '24 09:03 Iheadx

Qwen2.5 Qwen2.5 copied to clipboard

大佬们请问一下，qwen1.5是不是不使用flash-attention加速推理。那为什么相同硬件、安装包的两台服务器，模型运行速度不一样？Token都差不了多少

Qwen2.5
Qwen2.5 copied to clipboard