AmazDeng

Results 16 issues of AmazDeng

### Question Hi,authors, Thank you for your great contribution I've noticed that during the pretraining phase, the preprocess_plain method was used. This method discards the question part and directly concatenates...

### Your current environment ``` Collecting environment information... PyTorch version: 2.1.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6...

bug

### Checklist - [X] 1. I have searched related issues but cannot get the expected help. - [ ] 2. The bug has not been fixed in the latest version....

Hello Developers, Thank you for your outstanding work. Could you please provide the Training Hyperparameters used during the training of the LLaVA-NeXT-video and LaVA-NeXT-video-dpo model?

## Description I compiled the image part of the open_clip model (a PyTorch model,https://github.com/mlfoundations/open_clip) in a Python environment using TensorRT 8.6.1, and obtained an engine. Then, I developed a service...

Llama-next-image and Llama-next-image are fairly good multimodal models, and they are already supported in transformers. I would like to know if tensorrt-llm plans to support these two models? https://github.com/LLaVA-VL/LLaVA-NeXT https://huggingface.co/docs/transformers/model_doc/llava_next...

new model

I tested the batch inference results of the llava and llava-next-video models using tensorrt-llm based on the examples/multimodal/run.py file. The parameters for their generate method are the same, as follows....

question
waiting for feedback

I looked at the model card introduction but didn't see what the main differences are between these two models. Could the author explain?

For the llava-onevision model, the official video inference code does not modify the `image_aspect_ratio` parameter, resulting in the use of the default `anyres_max_9`. This causes the `image_features` to occupy a...

For the first version of the llava-next-video project, the model chosen was LLaVA-NeXT-Video-7B-DPO. If the number of frames is set to 32, the final inputs_embeds dimension sent to the llama2...