InternVL LMDeploy部署时tp>1，模型输出行为异常

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

【问题】用lmdeploy部署时：

tp=1，即不使用tp并行，模型输出正常，问题精准度很高，对图片理解正确
tp>1，不论是8B还是26B还是40B等模型，正常与模型纯语言交流没问题，但涉及图片内容的问题，输出几乎全是与输入图片不相关的，偶尔会出现无限重复输出等问题。

Reproduction

完全按照官网的启动指令，只不过最后加一个--tp设置并行：

lmdeploy serve api_server OpenGVLab/InternVL2-40B --backend turbomind --server-port 23333 --chat-template chat_template.json --tp 4

然后输入如下请求内容，就会输出异常（设置tp=1时就会正常，前提是一张卡的显存得够）

{
    "model": "OpenGVLab/InternVL2-26B",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "给这个图片起个标题"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "http://xxxx"
                    }
                }
            ]
        }
    ],
    "temperature": 0.001
}

Environment

【基本环境】
Nvidia A6000
CUDA 12.0, 驱动530.41.03
Python 3.12, lmdeploy 0.5.2, pytorch 2.2.2

Error traceback

No response

Aug 02 '24 02:08 tankgit

我遇到过一样的问题，你是不是没有安装flash attention

Aug 02 '24 10:08 czczup

我测试下来，只要安装了flash attention，tp就能正常使用，暂时还不清楚内部是什么原因

Aug 02 '24 10:08 czczup

我测试下来，只要安装了flash attention，tp就能正常使用，暂时还不清楚内部是什么原因

我确实之前没有安装flash attn，我试了pip install flash-attn，版本2.6.3，启动server时也没有出现flash attn的warning了，但我上述说的问题仍然存在😥没有任何变化

Aug 05 '24 06:08 tankgit

您好，可以再试试升级lmdeploy到0.5.3，并且如果安装了apex请卸载它；我最近在使用lmdeploy 0.5.3 + InternVL2-Llama3-76B + tp=8标数据，可以获得正常的推理结果，并且比pytorch (transformers)的推理速度有10倍的提升。

Sep 06 '24 14:09 czczup