lmdeploy
lmdeploy copied to clipboard
[Bug] MiniCPMV-2.6 HF 的推理结果和 lmdeploy 结果不一致
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
MiniCPMV-2.6 HF 的推理结果和 lmdeploy 结果不一致,复现的时候使用了 top_k = 1 结果仍然不一致。
复现流程
准备数据
wget "https://support.huaweicloud.com/api-ocr/zh-cn_image_0000001698774808.png"
到当前目录文件 zh-cn_image_0000001698774808.png
LMdeploy 复现方式
- 使用镜像
openmmlab/lmdeploy:v0.6.0a0-cu12 - 使用如下代码调用
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
from lmdeploy import pipeline
from lmdeploy.vl import load_image
backend_config = TurbomindEngineConfig(max_batch_size=1, cache_max_entry_count=0.4)
pipe = pipeline('/data/models/openbmb/MiniCPM-V-2_6/', log_level='INFO', backend_config=backend_config)
image_path = "zh-cn_image_0000001698774808.png"
prompt = "请详细识别图中的内容并以 markdown 格式返回"
messages = [
dict(role='user', content=[
dict(type='text', text=prompt),
dict(type='image_url', image_url=dict(url=image_path)),
])
]
gen_config = GenerationConfig(top_p=1, top_k=1, temperature=0.1, repetition_penalty=1.05, max_new_tokens=4096)
out = pipe(messages, gen_config=gen_config)
print(out.text)
模型输出
这张图片展示了一份门诊检验报告单,具体内容如下:
**标题:**
门诊检验报告单
**副标题:**
血常规(5分类)
**状态说明:**
标本状态:正常
**临床诊断:**
1. 慢性扁桃体炎
**检验项目列表及结果:**
- 中性细胞百分率 (NEL%):77.1%
- 参考范围:40-75%
- 淋巴细胞百分率 (LYM%):8.8%
- 参考范围:20-50%
- 单核细胞百分率 (MONO%):7.1%
- 参考范围:3.0-10.0%
- 红细胞计数 (RBC):6.66
- 参考范围:4.3-5.8%
**签名区域:**
送检医生:
检验者:
审核者:
这份报告详细列出了患者的血常规检查结果,并根据参考范围对各项指标进行了评估。
HF 复现方式
代码:
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
import os
import base64
import httpx
model = AutoModel.from_pretrained('/data/models/openbmb/MiniCPM-V-2_6/', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('/data/models/openbmb/MiniCPM-V-2_6/', trust_remote_code=True)
def chat_llm(image_path, prompt):
image = Image.open(image_path).convert('RGB')
message = [{'role': 'user', 'content': [image, prompt]}]
res = model.chat(
image=None,
msgs=message,
tokenizer=tokenizer,
temperature=0.1,
top_p = 1,
top_k = 1,
do_sample=True,
repetition_penalty=1.05,
)
print(res)
print("==============")
prompt = "请详细识别图中的内容并以 markdown 格式返回"
chat_llm("zh-cn_image_0000001698774808.png", prompt)
模型输出:
这张图片展示了一份门诊检验报告单,具体内容如下:
**标题:门诊检验报告单**
**副标题:血常规(5分类)**
**标本状态:正常**
**临床诊断:1.慢性扁桃体炎**
| 检验项目 | 结果 | 参考范围 | 单位 |
|---------|------|----------|------|
| 中性细胞百分率 (NEL%) | 77.1 | 40-75 | % |
| 淋巴细胞百分率 (LYM%) | 8.8 | 20-50 | % |
| 单核细胞百分率 (MONO%) | 7.1 | 3.0-10.0 | % |
| 红细胞计数 (RBC) | 6.66 | 4.3-5.8 | % |
**送检医生:**
[空白]
**检验者:**
[空白]
**审核者:**
[空白]
Reproduction
如上所示
Environment
如上所示
Error traceback
No response
对比
LMDeploy 的日志
prompt='<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image_id>0</image_id><image><IMAGE_TOKEN></image><slice><IMAGE_TOKEN></slice><slice><IMAGE_TOKEN></slice>\n<slice><IMAGE_TOKEN></slice><slice><IMAGE_TOKEN></slice>\n请详细识别图中的内容并以 markdown 格式返回<|im_end|>\n<|im_start|>assistant\n',
gen_config=EngineGenerationConfig(n=1, max_new_tokens=8192, top_p=1.0, top_k=1, temperature=0.1, repetition_penalty=1.05, ignore_eos=False, random_seed=4988746044838101047, stop_words=[151645], bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None, logits_processors=None),
prompt_token_id=[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151658, 15, 151659, 151646, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 151647, 151656, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 151657, 151656, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 151657, 198, 151656, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 151657, 151656, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 151657, 198, 14880, 100700, 102450, 28029, 101047, 43815, 62926, 23031, 50494, 51461, 120, 28330, 31526, 151645, 198, 151644, 77091, 198]
HF 的 debug 信息
input_ids: tensor([[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13,
151645, 198, 151644, 872, 198, 151658, 15, 151659, 151646,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 151647, 151656, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 151657, 151656, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 151657, 198,
151656, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 151657, 151656, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244, 128244,
128244, 128244, 128244, 128244, 128244, 151657, 198, 14880, 100700,
102450, 28029, 101047, 43815, 62926, 23031, 50494, 51461, 120,
28330, 31526, 151645, 198, 151644, 77091, 198]],
device='cuda:0', dtype=torch.int32)
对比
除了 128244 这个 ID 外,其他的 ID 都一样
This model is supported in lmdeploy v0.6.0a0 May upgrade to the latest version
This model is supported in lmdeploy v0.6.0a0 May upgrade to the latest version
I just tested on image openmmlab/lmdeploy:v0.6.0a0-cu12, the problem still exists.
We use 0 as the placeholder for the image embedding.
@zhjunqin
是想结果完全一样么?这个感觉不太可能。之前支持mimicpm-v-2.6的时候,我从hf那边拿到输入的embedding让lmdeploy推理,不开sampling的情况下,结果也是有差异的,这里认为是不同的kernel造成的误差。
另外除了kernel的实现有差异,lmdeploy这边vision推理是用的数据类型是float16,和你贴的hf复现方式中的类型bfloat16也会造成一些差别。不过就像上面说的,即使是相同的输入(embedding),两边的结果也会有一些差异。
你比较了input_ids,0和128244是两边不同的占位符,在embedding这个阶段都会用图像特征来替代的,如果只有这个数字有差异,也说明了input_ids两边是对齐的。
@zhjunqin
是想结果完全一样么?这个感觉不太可能。之前支持mimicpm-v-2.6的时候,我从hf那边拿到输入的embedding让lmdeploy推理,不开sampling的情况下,结果也是有差异的,这里认为是不同的kernel造成的误差。
另外除了kernel的实现有差异,lmdeploy这边vision推理是用的数据类型是float16,和你贴的hf复现方式中的类型bfloat16也会造成一些差别。不过就像上面说的,即使是相同的输入(embedding),两边的结果也会有一些差异。
你比较了input_ids,0和128244是两边不同的占位符,在embedding这个阶段都会用图像特征来替代的,如果只有这个数字有差异,也说明了input_ids两边是对齐的。
对,期望得到一样的确定性结果。
嗯,我对比 input_ids 想表达的也是,输入的 input_ids 两者是一致的。
根据你的分析来看,两边本身就是无法对其了,是么?而且我发现,特别是当模型输出文本越长,到后面,两者之间的差异越大。
@zhjunqin
是这样的,greedy decoding的情况下,一般前面还能保持一致,一旦某个位置出现差异的话,就相当于从这里开始输入就不一致了,后面一样的概率更小了。
lmdeploy这边vision推理是用的数据类型是float16,和你贴的hf复现方式中的类型bfloat16也会造成一些差别。不过就像上面说的,即使是相同的输入(embedding),两边的结果也会有一些差异 @irexyc 您好,我想问下lmdeploy的LLM部分是用什么数据类型呀
@peace-zy
根据 config.json中的torch_dtype来的
@peace-zy
根据 config.json中的torch_dtype来的 @irexyc 好的,多谢了,具体代码段是这里吗? https://github.com/InternLM/lmdeploy/blob/4e5cc16682bf6a413acff493874c90c91255f8bc/lmdeploy/turbomind/deploy/target_model/base.py#L29
@peace-zy
自动判断类型是这一段,目前只能跑fp16, bf16
https://github.com/InternLM/lmdeploy/blob/4e5cc16682bf6a413acff493874c90c91255f8bc/lmdeploy/turbomind/deploy/converter.py#L101-L148
或者可以手动指定 引擎的 dtype https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/messages.py
@peace-zy
自动判断类型是这一段,目前只能跑fp16, bf16
https://github.com/InternLM/lmdeploy/blob/4e5cc16682bf6a413acff493874c90c91255f8bc/lmdeploy/turbomind/deploy/converter.py#L101-L148
或者可以手动指定 引擎的 dtype https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/messages.py
@irexyc 好的,多谢了
@irexyc 为什么vision部分强制是float16呀?bf16的时候会有问题吗? 还有我发现一个现象,我用InternVL2 8B的模型,在H100和A100 80g 同样代码、同样镜像,temperature设置为0时,出现了奇怪现象 1.H100不开启flash attention的结果和A100开启flash attention的结果才能一致 2.两个型号gpu同时开启flash attention时结果不一致,同时不开启时也不一致。这是什么原因呢
因为之前引擎侧接受的输入是numpy的类型,numpy没有bf16数据类型,另外不是所有的gpu都支持bf16,为了简单就强制fp16了。
为了结果一致,需要硬件相同,算子实现一样才能没有完全没有误差。不需要纠结单个样例的结果是不是完全一样,不放心的话,可以用opencompass测下精度,这个对齐就没问题了。
因为之前引擎侧接受的输入是numpy的类型,numpy没有bf16数据类型,另外不是所有的gpu都支持bf16,为了简单就强制fp16了。
为了结果一致,需要硬件相同,算子实现一样才能没有完全没有误差。不需要纠结单个样例的结果是不是完全一样,不放心的话,可以用opencompass测下精度,这个对齐就没问题了。
@irexyc 非常感谢
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.
This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.