LLaVA-NeXT Only output [1, 2] tokens for 'lmms-lab/LLaVA-NeXT-Video-7B-DPO' video demo inference

Only output [1, 2] tokens for 'lmms-lab/LLaVA-NeXT-Video-7B-DPO' video demo inference

Open LeonLIU08 opened this issue 1 year ago • 2 comments

the output of output_ids is tensor([[1, 2]], device='cuda:0') Other output of the demo script is:

Question: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Please provide a detailed description of the video, focusing on the main subjects, their actions, and the background scenes ASSISTANT:

Response:

Jun 05 '24 11:06 LeonLIU08

Could you please inform me with the command you used.

Jun 05 '24 14:06 ZhangYuanhan-AI

The command: bash scripts/video/demo/video_demo.sh lmms-lab/LLaVA-NeXT-Video-7B-DPO vicuna_v1 32 2 True xxx.mp4

By the way, I found using pool_stride=4 can solve this, because the input token length with stride=2 is 4673 which is larger than the max_length of LLM (4096).

Jun 06 '24 02:06 LeonLIU08

LLaVA-NeXT LLaVA-NeXT copied to clipboard

Only output [1, 2] tokens for 'lmms-lab/LLaVA-NeXT-Video-7B-DPO' video demo inference

LLaVA-NeXT
LLaVA-NeXT copied to clipboard