VILA Abnormal Inference Time and Repetitive Summary with Efficient-Large-Model/LongVILA-R1-7B on Specific Video Chunk

Abnormal Inference Time and Repetitive Summary with Efficient-Large-Model/LongVILA-R1-7B on Specific Video Chunk

Open mobassir94 opened this issue 3 months ago • 3 comments

When using LongViLa-R1 for video summarization, I encountered an issue where one video chunk took an abnormally long time to process, resulting in a large summary with significant repetition.

Model: LongViLa-R1

Input Prompt: "Concise summary of the video."

Video: A video containing Traffic Accident scene.

Length: 80 seconds

Resolution: 1920x1080

FPS: 20

Chunking Settings:

Chunk Size: 10 seconds

I got 8 chunked videos, each having 10 seconds duration.

Observed Behavior:

1.Chunk 3 exhibited an exceptionally high inference time of 523.76 seconds, whereas other chunks averaged around 6 seconds.

2.The summary generated for Chunk 3 was excessively long and contained numerous repeated sentences, failing to provide a concise summary as requested by the prompt.

This suggests a potential issue where the model gets stuck in a loop or encounters a specific type of content in a video chunk that causes a performance bottleneck and output generation failure.

Per-Chunk Inference Log

Chunk 1: VLM inference time = 6.69 seconds

Chunk 2: VLM inference time = 5.09 seconds

Chunk 3: VLM inference time = 523.76 seconds

Chunk 4: VLM inference time = 6.52 seconds

Chunk 5: VLM inference time = 6.52 seconds

Chunk 6: VLM inference time = 5.29 seconds

Chunk 7: VLM inference time = 6.43 seconds

Chunk 8: VLM inference time = 4.35 seconds

please check summary.txt that contains all 8 chunks response

How can this issue be solved, and why is it occurring?

Aug 27 '25 07:08 mobassir94

How long is the video3？ seems the video is much longer than others.

Nov 23 '25 04:11 Lyken17

Video Length: 80 seconds (each chunk containing 10 seconds clip only)

Resolution: 1920x1080

FPS: 20

Nov 23 '25 05:11 mobassir94

this is strange and I haven't see this issue before. Can you post the issue in https://github.com/NVlabs/Long-RL/issues and @yukang2017 ? He knows more the inference details.

Nov 28 '25 05:11 Lyken17

VILA VILA copied to clipboard

Abnormal Inference Time and Repetitive Summary with Efficient-Large-Model/LongVILA-R1-7B on Specific Video Chunk

Per-Chunk Inference Log

VILA
VILA copied to clipboard