TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Does TensorRT-LLM support passing input_embeds directly?

Open Hukongtao opened this issue 1 year ago • 11 comments

For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.

Hukongtao avatar Aug 09 '24 08:08 Hukongtao

same question

DefTruth avatar Aug 12 '24 05:08 DefTruth

same question

哇 大佬! 知乎还关注了你

Hukongtao avatar Aug 12 '24 07:08 Hukongtao

我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。

I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

Oldpan avatar Aug 12 '24 14:08 Oldpan

是的,目前是只能采用这种实现。不过这种方式还是有点问题,就是对于传input emb,当你需要使用penalty的时候,transformers实际只会考虑output ids做惩罚,比如repetition_penalty。而trtllm通过传input ids做推理,实际上会考虑input ids+output ids一起做penalty, 从而会导致,在两边都用penalty的情况下,输出结果与trn的无法对齐。

DefTruth avatar Aug 12 '24 23:08 DefTruth

我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。

I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧,如果是多图呢?prompt里面有多个<image>标志位,怎么使用pre + img + post这种方式传呢?

Hukongtao avatar Aug 13 '24 02:08 Hukongtao

我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。

I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧,如果是多图呢?prompt里面有多个<image>标志位,怎么使用pre + img + post这种方式传呢?

其实支持直接传input emb就简单多了

DefTruth avatar Aug 13 '24 05:08 DefTruth

我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。 I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧,如果是多图呢?prompt里面有多个<image>标志位,怎么使用pre + img + post这种方式传呢?

多图也是可以的,可以参考trt-llm中vila的实现,不过确实只要传input emb就简单多了hh @DefTruth ,话说提到的trn是啥啊

Oldpan avatar Aug 13 '24 06:08 Oldpan

我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。 I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧,如果是多图呢?prompt里面有多个<image>标志位,怎么使用pre + img + post这种方式传呢?

多图也是可以的,可以参考trt-llm中vila的实现,不过确实只要传input emb就简单多了hh @DefTruth ,话说提到的trn是啥啊

trn -> transformers, 偷懒,缩写

DefTruth avatar Aug 13 '24 08:08 DefTruth

@Oldpan internvl2-2B 跑起来 推理总是输出max_token数,这是为什么

qism avatar Aug 15 '24 02:08 qism

@Oldpan internvl2-2B 跑起来 推理总是输出max_token数,这是为什么

我猜是end_id没设对

Oldpan avatar Aug 15 '24 10:08 Oldpan

@Oldpan @qism I met the same question as well while in my own Llama design. I passed the eos_id to the runner.generate function, but it still generates the token until met the length of max_new_tokens.

zengrh3 avatar Aug 15 '24 12:08 zengrh3

@Oldpan internvl2-2B 跑起来 推理总是输出max_token数,这是为什么

我猜是end_id没设对

我猜你猜得对

DefTruth avatar Aug 20 '24 10:08 DefTruth

end_id =2没错的 经测试,run.py --stop_words "<|im_end|>" 可以解决

qism avatar Aug 21 '24 07:08 qism

For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.

我们基于0.11.0 修改了qwen2以增加多模,测试正常,看看是否能帮助你们。 具体改动:https://github.com/bnuzhanyu/trtllm-mmodal/pull/1 核心思想是多传入 bs * seq_len * hidden_size的mmodal_embedding矩阵,以及加权权重。 最终给transformer的hidden_state = input_mask * word_emb + mmodal_mask * mmodal_embedding input mask和mmodal mask可以根据业务,对input_ids判断后进行设置成0/1的float16矩阵。

有以下限制:

  1. 只能用python的runtime,对于要用tritonserver的trtllm backend,目前没办法适配
  2. 只能生成普通token id,不支持生成多模对应的token_id
  3. 目前修改应该无法使用投机解码和beam search的特性(未测试)

bnuzhanyu avatar Aug 26 '24 05:08 bnuzhanyu

For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.

我们基于0.11.0 修改了qwen2以增加多模,测试正常,看看是否能帮助你们。 具体改动:bnuzhanyu/trtllm-mmodal#1 核心思想是多传入 bs * seq_len * hidden_size的mmodal_embedding矩阵,以及加权权重。 最终给transformer的hidden_state = input_mask * word_emb + mmodal_mask * mmodal_embedding input mask和mmodal mask可以根据业务,对input_ids判断后进行设置成0/1的float16矩阵。

有以下限制:

  1. 只能用python的runtime,对于要用tritonserver的trtllm backend,目前没办法适配
  2. 只能生成普通token id,不支持生成多模对应的token_id
  3. 目前修改应该无法使用投机解码和beam search的特性(未测试)

太牛了

Hukongtao avatar Aug 26 '24 06:08 Hukongtao

input_embeds can not be accessed directly. prompt_table should be used to pass visual features as input. The specific position of visual features within prompt changes from one model to another.

For multiple images, see https://github.com/NVIDIA/TensorRT-LLM/issues/2144#issuecomment-2330175706

amukkara avatar Sep 05 '24 01:09 amukkara

@amukkara So is there any ways to pass the input_embeds into the TensorRT LLM directly?

OswaldoBornemann avatar Sep 10 '24 06:09 OswaldoBornemann

end_id =2没错的 经测试,run.py --stop_words "<|im_end|>" 可以解决

大佬你好,我现在可以实现InternVL2语言部分使用在trt-llm使用,但是那个图像部分可以使用trt-llm加速吗?

scuizhibin avatar Sep 24 '24 10:09 scuizhibin

Is there any plan to support this requirement? It seems that there are many related application scenarios. @byshiue

Hukongtao avatar Sep 26 '24 07:09 Hukongtao

@Oldpan internvl2-2B 跑起来 推理总是输出max_token数,这是为什么

请问如何将input_embeds 传入到模型中进行推理?

scuizhibin avatar Oct 23 '24 07:10 scuizhibin

使用pre + img + post拼prompt的形式

请问 “使用pre + img + post拼prompt的形式” 指的是什么?

scuizhibin avatar Oct 23 '24 07:10 scuizhibin

@Hukongtao If you have no further questions, we will close it in a week.

hello-11 avatar Nov 14 '24 02:11 hello-11

My question is whether this feature will be supported in the future?

Hukongtao avatar Nov 14 '24 03:11 Hukongtao