VILA
VILA copied to clipboard
Abnormal Inference Time and Repetitive Summary with Efficient-Large-Model/LongVILA-R1-7B on Specific Video Chunk
When using LongViLa-R1 for video summarization, I encountered an issue where one video chunk took an abnormally long time to process, resulting in a large summary with significant repetition.
Model: LongViLa-R1
Input Prompt: "Concise summary of the video."
Video: A video containing Traffic Accident scene.
Length: 80 seconds
Resolution: 1920x1080
FPS: 20
Chunking Settings:
Chunk Size: 10 seconds
I got 8 chunked videos, each having 10 seconds duration.
Observed Behavior:
1.Chunk 3 exhibited an exceptionally high inference time of 523.76 seconds, whereas other chunks averaged around 6 seconds.
2.The summary generated for Chunk 3 was excessively long and contained numerous repeated sentences, failing to provide a concise summary as requested by the prompt.
This suggests a potential issue where the model gets stuck in a loop or encounters a specific type of content in a video chunk that causes a performance bottleneck and output generation failure.
Per-Chunk Inference Log
Chunk 1: VLM inference time = 6.69 seconds
Chunk 2: VLM inference time = 5.09 seconds
Chunk 3: VLM inference time = 523.76 seconds
Chunk 4: VLM inference time = 6.52 seconds
Chunk 5: VLM inference time = 6.52 seconds
Chunk 6: VLM inference time = 5.29 seconds
Chunk 7: VLM inference time = 6.43 seconds
Chunk 8: VLM inference time = 4.35 seconds
please check summary.txt that contains all 8 chunks response
How can this issue be solved, and why is it occurring?
How long is the video3? seems the video is much longer than others.
Video Length: 80 seconds (each chunk containing 10 seconds clip only)
Resolution: 1920x1080
FPS: 20
this is strange and I haven't see this issue before. Can you post the issue in https://github.com/NVlabs/Long-RL/issues and @yukang2017 ? He knows more the inference details.