TensorRT-LLM Does TensorRT-LLM support passing input

For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.

Aug 09 '24 08:08 Hukongtao

same question

Aug 12 '24 05:08 DefTruth

same question

哇大佬！知乎还关注了你

Aug 12 '24 07:08 Hukongtao

我也好奇这个input_embeds如何直接传，不确定你这里直接传input_embeds的具体需求是什么，是否和我一样。不过InternVL2这个可以使用trt-llm跑起来，使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好，实际输入trt-llm decoder engine的时候，和图像的visual_feature一起传入decoder engine，input_ids在其中进行embed后和visual_feature一起concat，这个是可以实现的。

I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

Aug 12 '24 14:08 Oldpan

是的，目前是只能采用这种实现。不过这种方式还是有点问题，就是对于传input emb，当你需要使用penalty的时候，transformers实际只会考虑output ids做惩罚，比如repetition_penalty。而trtllm通过传input ids做推理，实际上会考虑input ids+output ids一起做penalty, 从而会导致，在两边都用penalty的情况下，输出结果与trn的无法对齐。

Aug 12 '24 23:08 DefTruth

我也好奇这个input_embeds如何直接传，不确定你这里直接传input_embeds的具体需求是什么，是否和我一样。不过InternVL2这个可以使用trt-llm跑起来，使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好，实际输入trt-llm decoder engine的时候，和图像的visual_feature一起传入decoder engine，input_ids在其中进行embed后和visual_feature一起concat，这个是可以实现的。

I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧，如果是多图呢？prompt里面有多个<image>标志位，怎么使用pre + img + post这种方式传呢？

Aug 13 '24 02:08 Hukongtao

我也好奇这个input_embeds如何直接传，不确定你这里直接传input_embeds的具体需求是什么，是否和我一样。不过InternVL2这个可以使用trt-llm跑起来，使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好，实际输入trt-llm decoder engine的时候，和图像的visual_feature一起传入decoder engine，input_ids在其中进行embed后和visual_feature一起concat，这个是可以实现的。

I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧，如果是多图呢？prompt里面有多个<image>标志位，怎么使用pre + img + post这种方式传呢？

其实支持直接传input emb就简单多了

Aug 13 '24 05:08 DefTruth

我也好奇这个input_embeds如何直接传，不确定你这里直接传input_embeds的具体需求是什么，是否和我一样。不过InternVL2这个可以使用trt-llm跑起来，使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好，实际输入trt-llm decoder engine的时候，和图像的visual_feature一起传入decoder engine，input_ids在其中进行embed后和visual_feature一起concat，这个是可以实现的。 I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧，如果是多图呢？prompt里面有多个<image>标志位，怎么使用pre + img + post这种方式传呢？

多图也是可以的，可以参考trt-llm中vila的实现，不过确实只要传input emb就简单多了hh @DefTruth ，话说提到的trn是啥啊

Aug 13 '24 06:08 Oldpan

我也好奇这个input_embeds如何直接传，不确定你这里直接传input_embeds的具体需求是什么，是否和我一样。不过InternVL2这个可以使用trt-llm跑起来，使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好，实际输入trt-llm decoder engine的时候，和图像的visual_feature一起传入decoder engine，input_ids在其中进行embed后和visual_feature一起concat，这个是可以实现的。 I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧，如果是多图呢？prompt里面有多个<image>标志位，怎么使用pre + img + post这种方式传呢？

多图也是可以的，可以参考trt-llm中vila的实现，不过确实只要传input emb就简单多了hh @DefTruth ，话说提到的trn是啥啊

trn -> transformers, 偷懒，缩写

Aug 13 '24 08:08 DefTruth

@Oldpan internvl2-2B 跑起来推理总是输出max_token数，这是为什么

Aug 15 '24 02:08 qism

@Oldpan internvl2-2B 跑起来推理总是输出max_token数，这是为什么

我猜是end_id没设对

Aug 15 '24 10:08 Oldpan

@Oldpan @qism I met the same question as well while in my own Llama design. I passed the eos_id to the runner.generate function, but it still generates the token until met the length of max_new_tokens.

Aug 15 '24 12:08 zengrh3

@Oldpan internvl2-2B 跑起来推理总是输出max_token数，这是为什么

我猜是end_id没设对

我猜你猜得对

Aug 20 '24 10:08 DefTruth

end_id =2没错的经测试，run.py --stop_words "<|im_end|>" 可以解决

Aug 21 '24 07:08 qism

For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.

我们基于0.11.0 修改了qwen2以增加多模，测试正常，看看是否能帮助你们。具体改动：https://github.com/bnuzhanyu/trtllm-mmodal/pull/1 核心思想是多传入 bs * seq_len * hidden_size的mmodal_embedding矩阵，以及加权权重。最终给transformer的hidden_state = input_mask * word_emb + mmodal_mask * mmodal_embedding input mask和mmodal mask可以根据业务，对input_ids判断后进行设置成0/1的float16矩阵。

有以下限制：

只能用python的runtime，对于要用tritonserver的trtllm backend，目前没办法适配
只能生成普通token id，不支持生成多模对应的token_id
目前修改应该无法使用投机解码和beam search的特性（未测试）

Aug 26 '24 05:08 bnuzhanyu

For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.

我们基于0.11.0 修改了qwen2以增加多模，测试正常，看看是否能帮助你们。具体改动：bnuzhanyu/trtllm-mmodal#1 核心思想是多传入 bs * seq_len * hidden_size的mmodal_embedding矩阵，以及加权权重。最终给transformer的hidden_state = input_mask * word_emb + mmodal_mask * mmodal_embedding input mask和mmodal mask可以根据业务，对input_ids判断后进行设置成0/1的float16矩阵。

有以下限制：

只能用python的runtime，对于要用tritonserver的trtllm backend，目前没办法适配

只能生成普通token id，不支持生成多模对应的token_id

目前修改应该无法使用投机解码和beam search的特性（未测试）

太牛了

Aug 26 '24 06:08 Hukongtao

input_embeds can not be accessed directly. prompt_table should be used to pass visual features as input. The specific position of visual features within prompt changes from one model to another.

For multiple images, see https://github.com/NVIDIA/TensorRT-LLM/issues/2144#issuecomment-2330175706

Sep 05 '24 01:09 amukkara

@amukkara So is there any ways to pass the input_embeds into the TensorRT LLM directly?

Sep 10 '24 06:09 OswaldoBornemann

end_id =2没错的经测试，run.py --stop_words "<|im_end|>" 可以解决

大佬你好，我现在可以实现InternVL2语言部分使用在trt-llm使用，但是那个图像部分可以使用trt-llm加速吗？

Sep 24 '24 10:09 scuizhibin

Is there any plan to support this requirement? It seems that there are many related application scenarios. @byshiue

Sep 26 '24 07:09 Hukongtao

@Oldpan internvl2-2B 跑起来推理总是输出max_token数，这是为什么

请问如何将input_embeds 传入到模型中进行推理？

Oct 23 '24 07:10 scuizhibin

使用pre + img + post拼prompt的形式

请问 “使用pre + img + post拼prompt的形式” 指的是什么？

Oct 23 '24 07:10 scuizhibin

@Hukongtao If you have no further questions, we will close it in a week.

Nov 14 '24 02:11 hello-11

My question is whether this feature will be supported in the future?

Nov 14 '24 03:11 Hukongtao

TensorRT-LLM
TensorRT-LLM copied to clipboard

Does TensorRT-LLM support passing input_embeds directly？

TensorRT-LLM TensorRT-LLM copied to clipboard

Does TensorRT-LLM support passing input_embeds directly？

TensorRT-LLM
TensorRT-LLM copied to clipboard