lmdeploy turbomind attention算子接口求助

Checklist

I have searched related issues but cannot get the expected help.

我在知乎上看到lmdeploy的turbomind attention性能是flash attention的2倍，正好手上V100比较多，想看一下V100上turbomind attention相比xformers的性能会不会更好，但是一直找不到可以直接调用turbomind attention的接口。

May 14 '25 12:05 Lubenwei-nb123

turbomind attention 没有 python 接口，是 c++, cuda 写的。

May 14 '25 14:05 lvhan028

turbomind attention 没有 python 接口，是 c++, cuda 写的。

感谢回复，之前注意到了你们没有用pybind来绑定算子接口，想问一下在V100上测试不同规模输入下的turbomind attention性能有没有比较方便的方法，我看了一下/src/turbomind/kernel/attention，没太看懂里面的逻辑。

May 15 '25 04:05 Lubenwei-nb123

@lzhangzz is there any guide?

May 15 '25 05:05 lvhan028

其实我没有太理解，turbomind attention这么强，为什么在做internvl3时候还要调用flash attention，都统一使用turbomind attention好了，还可以提高速度

May 17 '25 08:05 bltcn

因为 turbomind 只支持了 LLM part。vision encoder 是复用 transformers 的。

May 19 '25 14:05 lvhan028